Unikernels, Software Containers and Serverless Architecture: Road to Modularity

This blog post is discussing the implications of Unikernels, Software Containers and Serverless Architecture on Modularity of complex software systems in a service mesh as illustrated below. Modular software systems claim to be more maintainable, secure and future proven compared to software monoliths.

StackedModules

Software containers or the alternative MicroVMs have been proven as very successful for realizing extremely scalable cloud services. Examples can be found in the areas of serverless computing and Big Data / NoSQL solutions in form of serveless databases (which are often not realized using containers). This has gone so far that upon a web request of a user the software container is started that executes a business function developed by a software engineer, an answer to the user is provided and then software container is stopped. Thus, large cost savings received in a cloud world where infrastructure and services are payed by actual consumption.

However, we will see in this post that there is still room for optimization (cost/performance) to modularize the application, which is usually based still on large monolith, such as the Java Virtual Machine with all standard libraries or the Python environment with many libraries that are in most cases not used at all to execute a single business function.  Furthermore, the operating system layer of the container is also not optimized towards the execution of a single business function as they contain much more operating system functionality than needed (e.g. drivers, file systems, network protocols, system tools etc.). Thus Unikernels are an attractive alternative to introduce cost savings in the cloud infrastructure.

Finally, we will discuss grouping of functions, ie where it make sense to combine a set of function of your application composed of single functions/microservices to one unit. Briefly we will address composable infrastructure.

Background: Software Containers and Orchestrators

The example above of course is a simplistic example and much more happens behind the scene. For example, the business function may need to fetch data from a datastore. It may need to communicate with other business functions to return an answer. This requires that these business functions, communication infrastructure and datastores need to work together, ie they need to be orchestrated. Potentially additional hardware (e.g. GPUs) need to be taken into account that is not all the time available due to cost.

This may imply, for example, that these elements run together in the same virtual network or should run on the same servers or servers close to each other for optimal response times. Furthermore, in case of failures they need to be rerouted to working instances of the business function, the communication infrastructure or the data store.

Here orchestrators for containers come into play, for example Kubernetes (K8S) or Apache Mesos. In reality, much more need to be provided, e.g. distributed configuration services, such as Etcd or Apache Zookeeper, so that every component always finds its configuration without relying on complicated deployment of local configuration files.

Docker has been a popular concept for software containers, but it was neither the first one nor was it based on new technologies. In fact, the underlying module (cgroups) of the Linux kernel has been introduced years before Docker emerged.

This concept has been extended by so-called MicroVM technologies, such as Firecracker, based on UniKernels to provide only the OS functionality needed. This increases reliability, security and efficiency significantly. Those VMs can startup much faster, e.g. within milliseconds, compared to Docker containers and thus are more suitable even for simple use cases of web service requests described above.

About UniKernels

UniKernels (also known as library operating systems) are core concept of modern container technologies, such as Firecracker, and popular for providing cloud services. They contain only the minimum set of operating system functionality necessary to run a business function. This makes them more secure, reliable and efficient with significant better reaction times. Nevertheless, they are still flexible to be able to run a wide range functionality. They contain thus a minimal needed kernel and a minimal needed set of drivers. UniKernels have been proposed for various domains and despite some successes to run them productively they are at the moment still a niche. Examples are:

  • ClickOs: Dynamically create new network devices/functions (switching, routing etc.) within milliseconds on a device potentially based on a software-defined network infrastructure
  • Runtime.js: A minimal kernel for running Javascript functions
  • L4 family of microkernels
  • Unik – compile application for using in UniKernels in Cloud environments
  • Drawbridge – a Windows-based UniKernel
  • IncludeOS – A lightweight Linux OS for containers/MicroVMs
  • Container Linux (formerly: CoreOS):  A lightweight OS to run containers, such as Docker, but more recently based on rkt. While this approach is very light-weight, it still requires that the rkt containers that are designed by developers are light-weight, too. Especially care must be taken that different containers do not only include the libraries necessary, but also only the parts of the libraries necessary and only one version of them.
  • OSv – run unmodified Linux application on a UniKernel
  • MirageOSOcaml based

Serveless Computing

Serverless computing is based on MicroVMs and Unikernels. Compared to the traditional containerization approaches this reduces significantly the resource usage and maintenance cost. On top, they provide a minimum set of libraries and engines (e.g. Java, Python etc.) to run a business function with ideally the minimum needed set of functionality (software/hardware). Examples for for implementations of serverless computing are OpenFass , Kubeless or OpenWhisk. Furthermore, all popular cloud services offer serverless computing, such as AWS Lambda, Azure Functions or Google Cloud Functions.

The advantage of serverless computing is that one ideally does not have to design and manage a complex stack of operating systems, libraries etc., but simply specifies a business function to be executed. This reduces significantly the operating costs for the business function as server maintenance, operating system maintenance, library maintenance are taken over by the serverless computing platform. Furthermore, the developer may specify required underlying platform versions and libraries. While those are usually offered by the service provider out of the box, they need to be created manually by them or the developer of the business function.

Those libraries that provide the foundation for a business function should ideally be modularizable. For example, for a given business function one does not need all the functionality of a Java Virtual Machine (JVM) including standard libraries. However, only recently Java has introduced a possibility to modularize the JVM using the Jigsaw extension that came with JDK9. This is already an improvement for more efficiency when using serverless computing, but the resulting modules are still comparably coarse grained. For example, it is at the moment not possible to provide to the Java compiler a given business function in Java and it strips out of the existing standard libraries and third party libraries only the functionality needed. This still highly depends on the developer and also there are some limits. For other libraries/engines, such as Python, the situation is worse.

The popular standard library (glibc) is also a big monolith library that is used by Java, Python, native applications and kernels that has a lot of functionality that is not used by a single business function or even application. Here alternatives exists, such as Musl.

This means that currently perfect modularization cannot be achieved in serverless computing due to the lack of support by underlying libraries, but it is improving continuously.

Service Mesh

A service mesh is a popular mean for communication to and between functions in serverless computing. Examples for service mesh technologies are Istio, Linkerd or Consul Connect. Mostly this refers to direct synchronous communication, because asynchronous communication, which is an important pattern for calling remote functions that take a long time to complete, such as certain machine learning training and prediction tasks, is not supported directly.

However, you can deploy any messaging solution, such as ZeroMQ or RabbitMQ, to realize asynchronous communication.

The main point here is that service meshes and messaging solutions can benefit a lot from modularization. In fact, the aforementioned Clickos is used in network devices to spawn up rapidly any network function as a container that you may need from the network device, such as routing, firewall or proxying. Those concepts can be transferred to services meshes and messaging solution to deliver only the minimal functionality needed for a secure communication between serverless computing function.

The modularization of the user interface

One issue with user interfaces is that they basically provide a sensible composition of business functions that can be triggered by them. This means they support a more or less complex business process that is executed by one or more humans. In order to be usable they should present a common view on the offered functionality to the human users. New technologies, such as Angular Ivy, supports extracting from a UI library only the needed functionality reducing code size, security and reliability of the UI.

The aforementioned definition of UI means that there is at least one monolith that combines all the user interfaces related to the business functions in a single view for a given group of users. Since decades there are technologies out there that can do this, such as:

  • Portals: Portlets. More structured UI aggregation already at server side
  • Mashups: Loosely coupling of UIs using various “Web 2.0” technologies, such as REST, Websockets, JSON, Javascript, and integrating content from many different services

One disadvantage with those technologies is that a developer needs to combine different business functions into a single UI. However, the user may not need all the functionality of the UI and it cannot be expected that a developer manually combines all UIs of business functions for different user groups.

It would be more interesting that UIs are combined dynamically given the user context (e.g. desk clerk vs stewardess) using artificial intelligent technologies. However those approaches exist in academia since many years, but have not yet been managed to use in a production environment at large scale.

Finally, one need to think about distributed security technologies, such as OpenID:Connect, to provide proper authentication and authorization to access those UI combinations.

Bringing it all together: Cloud Business Functions and Orchestration

With the emergence of serverless computing, Microservices and container-based technologies we have seen the trends towards more modularization of software. The key benefits are higher flexibility, higher security and simpler maintenance.

One issue related to this is how to include only the minimal set of software to run a given business function. We have seen that it is not so easy and currently one still has to include large monolith libraries, such as Glibc, Python  or Java, to run a single business function. This increases the risk of security issues and requires still big upgrades (e.g. moving to another major version of an underlying library). Additionally, also the underlying operating system layer is far form being highly modularizable. Some operating systems exist, but they remain mostly in the domain of highly specialized devices.

Another open question is how to deal with the feature interaction problem as the possible number of combinations and modules may have unforeseen side-effects. On the other hand, one may argue that higher modularization and isolation will make this less of a problem. However, those aspects still have to be studied.

Finally, let us assume several business functions need to interact with each other (Combined Business Functions – CBF). One could argue that they could share the same set of modules and versions that they need. This would reduce complexity, but this is not always easy in serverless computing, where it is quite common that a set of functions is developed by different organisations. Hence, they may have different versions of a a shared module. This might be not so problematic if even in different versions the underlying function has not changed. However, if it changes then it can lead to subtle errors if two business functions in serverless computing need to communicate. Here it would be desirable to capture those changes semantically, e.g. using some logic language, to automatically find issues or potentially resolve them in the service mesh / messaging bus layer. One may think in this context as well that if business functions run on the same node they could share potentially modules to reduce the memory footprint and potentially CPU resources.

Answers to those issues will also make it easier  to upgrade serverless computing functions to the newest version offering the latest fixes.

In the future, I expect:

  • CBF Analyzer that automatically derive and extract the minimum set of VM, uni kernel, driver and library/engine functionality needed to run a business function or a collection of loosely coupled business functions
  • Extended Analysis on colocating CBFs that have the optimal minimum set of joint underlying dependencies (e.g. kernel, driver etc.)
  • Dynamically during runtime of a function making shared underlying modules in native libraries and operating system code available to reduce resource utilization
  • Composable infrastructure and software-defined infrastructure will not only modularize the underlying software infrastructure, but the hardware itself. For instance, if only a special function of a CPU is needed then other part of the CPUs can be used by other functions (e.g. similar to Hyper-Threading). Another example is the availability and sharing of GPUs by plugging them anywhere into the data center.

 

Advertisements

GPUs, FPGAs, TPUs for Accelerating Intelligent Applications

Intelligent Applications are part of our every day life. One observes constant flow of new algorithms, models and machine learning applications. Some require ingesting a lot of data, some require applying a lot of compute resources and some address real time learning. Dedicated hardware capabilities can thus support some of those, but not all. Many mobile and cloud devices have already hardware accelerated support for intelligent applications and offer intelligent services out of the box.

Generally, the following categories of hardware support for intelligent applications can be distinguished:

  • Graphics Processing Units (GPU)

  • Field Programmable Gate Arrays (FPGA)

  • Tensor Processing Units (TPU)

GPUs are one of the first specialized hardware support for intelligent applications. That may not have been so obvious, because their primary focus has been accelerating 3D rendering of games. In fact, originally only a small part of the GPU rendering capabilities was needed for machine learning, but it still had to go through all rendering stages. That changed with new architectures for GPU allowing more flexible pipelines. Compared to CPUs or FPGAs they are only good at few specific tasks and you need to buy new hardware to evolve with them. GPUs can be seen as a varient of ASICs (see TPUs below).

FPGAs are highly specialized chipsets. Their advantage is that they have much of the programming encoded in the hardware in form of reusable blocks that can be arbitrarily combined in software. While most of those reusable blocks are configurable to represent logic functions, they can also be specialized hardware providing memory, network interconnection and so on. Despite the flexibility they can offer very good performance, but are more difficult to produce and consume a lot of energy. They would be suitable machine learning algorithms that can be currently not accelerated by GPUs/TPUs and innovative algorithms where the complete potential of hardware acceleration is not yet known. FPGAs would allow to upgrade the hardware to accelerate those by doing a “simple” software upgrade.

TPUs have been initially proposed by Google and are based on application-specific integrated circuits (ASIC), see for an overview here. Hence, they have similarities with GPUs. However, contrary to GPUs all functionality not relevant for specific machine learning algorithms has been discarded. For example, they have less precision and more compact data types.

A more detailed descriptions of the difference between ASICs and FPGAs can be found here.

Nowadays I see more and more that specialized hardware can be combined into clusters (e.g. “GPU cluster”) to offer even more compute power.

Hardware support can significantly increase performance of intelligent applications, but usually only for a small part of them. The loading of data, the efficient representation of data in-memory, the search within the data, the selection of the right subset of data and the transformation cannot be accelerated in most cases and require careful design of proper distributed systems using NoSQL databases and Big Data platforms. Hence, one should see those specialized solutions in context of a larger architecture than in an isolated manner.

What is the secret of specialized hardware?

Specialized hardware for intelligent applications improves a compute-intensive machine learning algorithm by implementing part of algorithm, such as matrix multiplication, directly in hardware and at the same time using for calculation more efficient non-standard data types. For instance, TPUs use instead of floating point numbers non-standard narrow integers. They require less space, can be calculated faster in hardware, but are less precise. However, for most of the intelligent applications the precision of other data types which are much more in-efficiently to calculate does not matter. FPGAs work in a similar way, but provide some flexibility to reprogram data type or which part of the algorithm is hardware optimized. GPUs are more generic and have been originally designed for other purposes. Hence, they are not as fast, but they offer more flexibility, are more precise and are for most of the machine learning problems sufficient.

Specialized hardware can optimize the training part of intelligent applications, but what is much more important it can optimize the prediction part which is usually more compute-intensive because it has to be applied in production on the whole dataset.

Who uses this specialized hardware?

Until some years ago specialized hardware were used by large Internet companies whose business model was to process large amounts of data of their users in an intelligent way to make revenue out of it. Nowadays many more companies leveraging specialized hardware – let it be in the cloud or on-premise. Most of them work with video/image/audio streams and trying to address use cases such as autonomous cars, image and voice recognition.

Can it be made available in your local data center?

Specialized hardware can be deployed in your data center. Most data center hardware vendors include a GPU in their offering that you can procure as an extra component to be made available in a virtualized form for your applications. TPUs and FPGAs are more uncommon whereby the first is usually custom-made and not many are currently available for normal enterprise data centers. FPGAs are more available for the data center, but require a lot of custom programing/adaptation which most enterprise wont do. Furthermore, they might not be supported by popular machine learning libraries.

However, if you want to provide them in your data center then you also need to thing about backup&restore and more important disaster recovery scenarios covering several data centers. This is not a trivial task, especially with contemporary machine learning libraries and how they are leveraged when developing intelligent applications.

What is available in the cloud?

All popular cloud provider offer at least GPU support. Some go beyond and offer TPU, FPGA support:

  • Amazon AWS currently supports general GPU Virtual Instances (P3/P2), GPUs for graphic processing (G3) and FPGA instances (F1). Some of the instance types support GPU clusters for very compute intensive intelligent applications. You can also dynamically assign GPUs to instances only for the time you need them to save costs. The AWS Deep Learning AMIs (virtual images) have already common machine learning toolkits preconfigured and installed that integrate with the offered hardware accelerators. Many other services exist that offer specific functionality (image recognition, translation, voice recognition, time series forecasting) that have the acceleration in-build without the consumer of those services even realizing.

  • Microsoft Azure has a similar offering. There you find as well dedicated instances for accelerating machine learning and graphics processing. Special images have preconfigured the operating system and popular machine learning libraries. Brainwave offers the integration of FPGAs in your intelligent application. Cognitive services offer intelligent functionality, such as image, voice recognition search etc.

  • Google Cloud platform has a similar offering to the ones above.

As you can see, the cloud can even free you from extremely costly tasks such as developing machine learning models, training of them, preparation of training data, evaluating them and most costly making them production ready. You can simply consume intelligent services without a heavy upfront investment. This model makes for many enterprises most sense.

What about software support?

Many libraries for intelligent application support GPUs and have even specialized versions by the cloud providers or hardware vendors that integrate more “exotic” support (e.g. TPU or FPGA). Due to this, the libraries start supporting also clustering of GPUs and other specialized hardware, ie the combination of several GPUs to have more compute power for an intelligent application. However, this is complex and thus one should be careful not working too low level, because you will spend more time on optimizing the code than focusing on the business problem. It is for a normal enterprise also not likely that they can optimize better than existing libraries provided by cloud providers and hardware vendors. This means for many applications they do not use libraries, such as Nvidia CUDA or to some extend Tensorflow. They are for very specialized companies or very specialized problems that requires new type of models or data handling way beyond existing models provided by higher-level frameworks. Instead most of the enterprises will rely on higher-level libraries, such as Keras, MXNet, DeepLearning4j or cloud services as described before. Reason is that leveraging the specialized hardware and combining it with an an algorithm requires a lot of know-how to run it properly and efficient. For instance, you can see also from the TPU description above that some optimizations (e.g. accuracy) are restricted in the hardware and the algorithms have to be adapted to this.

How to deal with costs?

One side note on costs for on-premise specialized hardware. The specialized hardware has very fast iterations – every few months a new better version is offered. That does not mean that you need to buy every few month new hardware. Instead you can think about renting hardware for your datacenter, combining several specialized hardware items to one (“GPU Cluster”) and/or hybrid cloud deployments. Furthermore, if your intelligent application is very “hardware hungry” in a lot of cases a better design of the application can bring you more intelligence and higher performance.

The cloud offers various cost models, for instance, the most prominent one is hourly billing of GPU time.

What can I not do?

Not all machine learning algorithm benefit from specialized hardware. Specialized hardware is basically very good in everything that relies on matrix operations. Those are the foundation of many machine learning algorithm, but not all. This applies to training of the data, but more important also to forecasting. The other category (e.g. variants of decision trees or certain graph algorithms) nonetheless have a need and usually perform in any case well without hardware acceleration. Furthermore, specialised hardware will help you little with one of the most important steps, e.g. may types of feature extraction, cleaning of the data etc.

Conclusion

Keep in mind that there is a lot to optimize in an intelligent applications. Key is to understand the underlying business problems and not “fall in love” with specific hardware or algorithms. Not all intelligent algorithms can be sped up with specialized hardware, but those are nevertheless useful in many scenarios.

In some cases a different application design can bring significant performance and accuracy increase without relying on specialised hardware. For instance, by leveraging NoSQL databases.

In other cases, a better understanding of the underlying business problem to solve will deliver insights on how to improve an intelligent application.

However, there are use cases for specialised hardware for training of certain intelligent applications, but more importantly for prediction, which needs to be applied on all items of a dataset in production and is usually very compute intensive.

Enterprise should carefully evaluate if they want to enter the adventure of machine learning and even integrate it with specialised hardware. Nowadays cloud services offer intelligent services that can be simply consumed and that have been trained by experts. Replicating this in the own enterprise requires a heavy up-front investment. Especially, if you are just starting the journey for intelligent applications, it is recommended that you first check the cloud offerings and experiment with them. The offered intelligent services by cloud providers should serve as benchmarks for internal endeavors to determine if they can compete and are worth the investment.

In the future I see a multichip world where an intelligent application leverage many different chips (GPUs, FPGAs, TPUs) at the same time in an intelligent cluster. Even each dataset and algorithm can have a custom developed chip. Those ensembles will be the next steps towards more sophisticated intelligent applications and especially cloud services ready for consumption.

Collaborative Data Science: About Storing, Reusing, Composing and Deploying Machine Learning Models

Why is this important?

Machine Learning has re-emerged in recent years as new Big Data platforms provide means to use them with more data, make them more complex as well as allowing combining several models to make an even more intelligent predictive/prescriptive analysis. This requires storing as well as exchaning machine learning models to enable collaboration between data scientists and applications in various environments. In the following paragraphs I will present the context of storing and deploying machine learning model, describe the dimensions into which model storage and deployment frameworks can be described, classify existing frameworks in this context and conclude with recommendations.

Context

Machine learning models usually describe mathematical equations with special parameters, e.g.

y = a*x +b with y as the output value, x as the input value and a/b are parameters.

The values of those parameters are usually calculated using an algorithm that takes training data as input. Based on the training data the parameters are calculated to fit the mathematical equation to the data. Then, one can provide an observation to the model and it predicts the output related to the observation. For instance, given a customer with certain attributes (age, gender etc.) it can predict if the customer will buy the product on the web page.

At the same time, as machine learning models grew more complex, they were used by multiple people or even developed jointly as part of large machine learning pipelines – a phenomena commonly known as data science.

This is a paradigm shift from earlier days where everyone mostly worked in isolation and usually one person had a good idea what an analysis was about.

While it is already a challenge to train and evaluate a machine learning model, there are also other difficult tasks to consider given this context:

  • Loading/Storing/Composing different models in an appropriate format for efficient usage by different people and applications

  • Reusing models created in one platform on another platform with a different technology and/or capacity considerations in terms of hardware resources

  • Exchanging models between different computing environments within one enterprise, e.g. to promote models from development to production without the need to deploy potential risky code in production

  • Discussing and evaluating different models by other people

  • Offering pre-trained models in market places so enterprises can take/buy them and integrate then together with other prediction models in their learning pipeline

Ultimately, there is a need to share those models with different people and embed them in complex machine learning pipelines.

Achieving those tasks is critical to understand how machine learning models evolve and use the latest technologies to gain superior competitive advantages.

We describe the challenges in more details and then follow up how technologies, such as PMML or software container, can address them as well as how they are limited.

Why are formats for machine learning models difficult?

  • Variety of different types of models, such as discriminative and generative, that can be stored. Examples are linear regression, logistic regression, support vector machines, neural networks, hidden Markov models, regenerative processes and many more
  • An unambiguous definition of metadata related to models, such as type of model, parameters, parameter ontologies, structures, input/output ontologies, input data types, output data types, fitness/quality of the trained model and calculations/mathematical equations, needs to be taken into account

  • Some models are very large with potentially millions/billions of features. This is not only a calculation problem for prediction, but also demands answers on how such models should be stored for most efficient access.

  • Online machine learning, ie machine learning models that are retrained regularly, may need additional meta-data definitions, such as which data has been applied to them when, what data should be applied to them from the past, if any, and how frequently they should be updated

  • Exchange of models between different programming languages and systems is needed to evolve them to newest technology

  • Some special kind of learning models, e.g. those based on graph models, might have a less efficient matrix representation and a more efficient one based on lists. Although there are compression algorithms for sparse matrixes, they might not be as efficient for certain algorithms as lists

  • Models should be easy to version

  • Execution should be as efficient as possible

Generic Ways on Managing Machine Learning Models

We distinguish storage approaches for machine learning models across the following dimensions:

– Low ambiguity / high ambiguity

– Low flexibility / high flexibility

Ideally, a model has low ambiguity and high flexibility. It is very clear (low ambiguity) what the model articulates, so it can be easily shared, reused, understood and integrated (possibly automatically) in complex machine learning pipelines. High ambiguity corresponds to a black-box approach: some code is implemented, but nobody knows what it does, what are the underlying scientific/domain/mathematical/training assumptions/limitations. This makes those models basically useless, because you do not know their impact on your business processes.

Furthermore, one can articulate all possible models of any size until now as well as in the future, which correspond to high flexibility.

Obviously, one may think that low ambiguity and high flexibility is the ideal storage format. However, this introduces also complexity and a much higher effort to master it. In the end it always depends on the use case and the people as well as applications working with the model.

In the following diagram you see how different model storage formats could be categorized across different dimensions.

ML Models ambiguity vs flexibility

In the following we describe in more detail what these storage formats are and how I came up with the categorization:

CSV (Comma-Separated Values) and other tabular formats (e.g. ORC, Parquet, Avro):

Most analytical tools allow to store machine learning models in CSV or other tabular formats. Although many analytical tools can process CSV files, the CSV or other tabular formats do not adhere to a standard on how columns (parameters of the model) should be named, how data types (e.g. doubles) are represented and there is no standard on how metadata should be described. It does not describe anyway on how it can be loaded/processed or any computations to be performed. In virtually all cases the CSV format requires for each tool to implement a custom ETL process to use it as a model when loading/storing it. Hence, I decided it is low flexibility, because any form of computation is defined outside the CSV or other tabular format. One advantage with respect to flexibility is that with CSV and much more with specialized tabular formats (ORC, Parquet etc.) one can store usually very large models. In conclusion is categorized as High Ambiguity and Low flexibility.

PMML (Predictive Model Markup Language):

PMML exists already since 1997 and is supported by many commercial and open source tools (Apache Flink, Apache Spark, Knime, TIBCO Sportfire, SAS Enteprise Miner, SPSS Clementime, SAP Hana). PMML is based on XML (eXtensible Markup Language) and is articulated as an XML Schema. Hence, it reduces significantly ambiguity by providing a meta model around how transformations, models are described. Although this meta model is very rich, it does include only a subset of algorithms (many popular ones though) and it cannot be easily extended with new transformations or models that are then automatically understand by all tools. Furthermore, the meta model does not allow to articulate on which data the model was trained or on which ontology/concepts the input, output data is based. The possible transformations and articulated models do make it more flexible then pure tabular formats, but since it is based on XML it is not suitable for very large models containing a lot of features.

PFA (Portable Format for Analytics):

PFA is a more recent storage format compared to PMML and appeared around 2008. That means also that contrary to PMML it includes design considerations for “Big Data” volumes by taking into account Big Data platforms. Its main purpose is to exchange, store and deploy statistical models developed in one platform in another platform. For instance, one may write a trained model in Python and use it for predictions in a Java application running on Hadoop. Another example is that a developer trains the model in Python in the development environment and stores it in PFA to deploy it securely in production where it is run in a security-hardened Python instance. As you see it is already very close to the use cases described above. Additionally it takes into account Big Data aspects by storing model data itself in AVRO format. The nice thing is that you can actually develop your code in python/Java etc. and then let a library convert it to PFA, ie you do not need to know the complex and little bit cumbersome syntax of PFA). As such it provides a lot of means to reduce ambiguity by defining a standard and a large set of conformance checks towards the standard. This means if someone develops PFA support for a specific platform/library then it can be ensured that it adheres to the standard. However, ambiguity cannot be estimated as very low, because it has no standardized means to describe input and output data as part of ontologies or fitness/underly training assumptions. PFA supports definition of a wide range of existing models, but also new ones by defining actions and control flow/data flow operators as well as a memory model. However, it is not as flexible as e.g. developing a new algorithm that specially takes into account specific GPU features to run most-efficiently. Although you can define such an algorithm in PFA, the libraries used to interpret PFA will not know how to optimize this code for GPUs or distributed GPUs given the PFA model. Nevertheless, for the existing predefined models they can of course derive a version that runs well on GPUs. In total it has between low – medium ambiguity and high – medium flexibility.

ONNX (Open Neural Network Exchange Format):

ONNX is another format for specifying storage of machine learning models. However, its main focus are neural networks. Furthermore, it has an extension for “classical” machine learning models called ONNX-ML. It supports different frameworks (e.g Caffe2, Pytorch, Apple CoreML, TensorFlow) and runtimes (e.g. Nvidia, Vespa). It is mostly Python-focused, but some frameworks, such as Caffe2 offer a C++ binding. Storage of ML models is specified in protobuf, which offers itself already a wide tool support, but is of course not ML specific. It offers description of meta data related to a model, but in a very generic sense of key value pairs, which is not suitable to describe ontologies. It allows to specify various operators that are composed by graphs describing the data flow. Datatypes that are used as part of input and output specifications are based on protobuf datatypes. Contrary to PFA ONNX does not provide a memory model. However, similarly to PFA it does not allow the full flexibility, e.g. to write code in GPUs. In total it has between low – medium ambiguity and between high – medium flexibility, but ambiguity and flexibility are a little bit lower than PFA.

Keras – HDF5

Keras stores a machine learning model in HDF5, which is a dedicated format for „managing extremely large and complex data collections“. HDF5 itself supports many language ranging from Python over C to Java. However, Keras is mostly a Python library. HF5 claims to be a portable file format and suitable for high performance as it includes special time and storage space optimizations. HDF5 itself is not very well supported by Big Data platforms. However, Keras stores in HDF5 architecture of the model, weights of the model, training configuration and the state of the optimizer to allow resume training were it was left off. This means contrary to simply using a tabular format as described before, it sets a standard for expressing models in a tabular format. It does not store itself training data or any more meta data beyond the previously described items. As such it has from medium to high ambiguity. Flexibility is between low and medium, because it can describe more easily models or state of the optimizer.

Tensorflow format

Tensorflow has its own format for loading and storing a model, which includes variables, the graph and graph metadata. Tensorflow claims the format is language-neutral and recoverable. However, it is mostly used in the Tensorflow library. It provides only few possibilities to express a model. As such it has high – medium ambiguity. Flexibility is higher than CSV and ranges from low to medium.

Apache Spark Internal format for storing models (pipelines)

Apache Spark offers storing a pipeline (representing a model or a combination of models) in its own serialization format that can be only used within Apache Spark. It is based on a combination of JSON describing metadata of the model/pipeline and Parquet for storing model data (weights etc.) itself. It is limited to the models available in Apache Spark and cannot be extended to additional models easily (expect by extending Apache Spark). As such it ranges between high to medium ambiguity. Flexibility is limited between low and medium flexibility, because it requires Apache Spark to run and there is limited to the models offered by Apache Spark. Clearly, one benefit is that it can store compositions of models.

Theano – Python serialization (Pickle)
Theano offers Python serialization (“Pickle”). This means nearly all (with some restriction) that can be expressed in Python and its runtime data structures can be stored/loaded. Python serialization – as any other programming language serialization, such as Java – is very storage/memory hungry and slow. Additionally, the Keras documentation (see above) does not recommend it. It has also serious security issues when bringing models from development to production (e.g. someone can put anything there even things that are not related to machine learning and can exploit security holes with confidential data in production). Furthermore, serialization between different Python versions might be incompatible.

The ambiguity is low to medium, because basically only programming language concepts can be described. Metadata, ontologies etc. cannot be expressed easily and a lot of unnecessary Python-specific information is stored. However, given that it offers the full flexiblity of Python it ranges from medium to high flexibility.

Software Container

Some data science tools allow to define a model in a so-called software container (e.g. implemented in Docker). These are packages that can be easily deployed and orchestrated. They basically allow to contain any tool one wants. This clearly provides a huge flexibility to a data scientists, but at the cost that usually the software containers are not production ready as they are provided by data scientists, who don’t have the same skill as enterprise software developers. Usually they lack an authorization and access model or any hardening, which makes them less useful for confidential or personal data. Furthermore, if data scientists can install any tool then this leads to a large zoo of different tools and libraries, which are impossible to maintain, upgrade or apply security fixes. Usually only the data scientist that created them knows the details on how the container and the containing tools are configured making it difficult for others to reuse it or to scale it to meet new requirements. Containers may contain data, but this is usually not recommended for data that changes (e.g. models etc.). In these cases one needs to link a permanent storage to the container. Of course, the model format itself is not predefined – any model format maybe used depending on the tools in the container.

As such they don’t provide any mean to express any information of the model, which means they have a very high ambiguity. However, they have a high flexibility.

Jupyter Notebooks

Jupyter notebooks are basically editable webpages in which the data scientist can write text that describes code (e.g. in Python) that is executable. Once executed the page will be rendered with the results from the executed code. These can be tables, but also graphs. As such, notebooks can support various programming languages or even mix different programming languages. Execution depends on data stored outside the notebook on a storage in any format that is supported by the underlying programming language.

Descriptions can be as rich, but they are described natural language and thus difficult to process by an application, e.g. to reuse it in another context, or to integrate them into a complex machine learning pipeline. Even for other data scientists this can be difficult if the descriptions are not adequate.

Notebooks can be more understood in the scientific context, ie writing papers and publishing them for review, which does not address all the use cases described above.

As such it provides high flexibility and medium to high ambiguity.

Conclusion

I described in this blog post the importance of the storage format for machine learning models:

  • Bring machine learning models from the data scientist to a production environment in a secure and scalable manner where they are reused by applications and other data scientists

  • Sharing and using machine learning models cross systems and organizational boundaries

  • Offering pretrained machine learning models to a wide range of customers

  • (Automatically) composing different models to create a new more powerful combined model

We have seen many different solutions across the dimensions flexibility and ambiguity. There is not one solution that fits it all for all use cases. This means there is no perfect standard solution. Indeed, an organization will likely employ two or more approaches or even potentially combine them. I see in the future four major directions:

  • Highly standardized formats, such as the portable format for analytics (PFA), that can be used across applications and thus data scientists using them

  • Flexible descriptive formats, such as notebooks, that are used among data scientists

  • A combination of flexible descriptive formats and highly standardized formats, such as using PFA in an application that is visualized in a Notebook at different stages of the machine learning pipeline

  • An extension of existing formats towards online machine learning, ie updatetable machine learning models in streaming applications

Automated Machine Learning (AutoML) and Big Data Platforms

Although machine learning exists already since decades, the typical data scientist – as you would call it today – would still have to go through a manual labor-intensive process of extracting the data, cleaning, feature extraction, regularization, training, finding the right model, testing, selecting and deploying it. Furthermore, for most machine learning scenarios you do not use one model/algorithms but evaluate a plethora of different algorithms to find one suitable for the given data and and use case. Nowadays a lot of data is available under the so-called Big Data paradigm introducing additional challenges of mastering machine learning on distributed computing platforms.

This blog post investigates on how to ease the burden on the data scientists of manual labor-intensive model evaluation by presenting insights on the recent concept of automated machine learning (AutoML) and if it can be adapted to Big data platforms.

What is AutoML

Machine learning is about learning and making prediction from data. This can be useful in many contexts, such as autonomous robots, smart homes, agriculture or financial markets. Machine Learning – despite the word „machine“ in its name – is mostly a manual process that requires a highly skilled person to execute. Although the learning part is rather automated, this person needs to extract data, transform it in potentially different alternative ways, feed it into many alternative machine learning algorithms as well as using different parameters for the same algorithm, evaluate the quality of the generated prediction model and finally deploy this model for others so that they can make their own predictions without going through this labor-intensive process themselves. This is illustrated in the following figure:
AutoML Flow Diagram(1)

Given this background, it comes at no surprise that huge market places have been created where people sell and buy trained machine learning models. For example, the Azure Machine Learning Market place, the Amazon AWS Artificial Intelligence market place, the Algorithmia Marketplace, the caffe model zoo, deeplearning4j models, mxnet model zoo, tensor flow models or the Acumos Marketplace (based on the open source software).

Nevertheless, most organizations that wants to keep up with the competition in these market places or that do not want to rely on market places due to unique problems in their organizations have still the issue to find skilled persons that create prediction models. Furthermore, machine learning models become so complex that they are not the outcome of a single person, but a team that needs to ensure consistent quality.

Here AutoML comes into play. It can support highly skilled machine learning persons, but also non-skilled machine learning persons to create their own prediction models in a transparent manner. Ideally this is done in an automated fashion for the whole machine learning process, but contemporary technology focuses mostly on the evaluation of different models, a set of suitable parameters for these models („hyperparameter optimization“) and automated selection of the „winning“ (or best) model(s) (this is highlighted in the previous figure in green). In case of several models a ranking is created on a so-called “leaderboard”. This has to be done in a given time budget, ie one has only limited time and resources. Potentially several models could be combined (aka deep learning) in more advanced solutions, but this is currently in its infancy.

Some observations here for this specific focus:

  • AutoML does not invent new models. It relies on an exisiting repository of algorithms that it iterates and tries out with different parameters. In itself AutoML can be seen as a “meta-algorithm” over these models.
  • Similarly it relies on an existing repository of tests that determine suitability of a model for a given problem.
  • It requires clever sampling (size of sample, randomness) of the underlying training data, otherwise it may take very long to evaluate the models or simply the wrong data is used. A preselection of the data by a person is still needed, although the person does not require as much machine-learning specific skills.
  • A person still needs to determine for the winning model if it makes sense what it predicts. For this the person does not need machine learning skills, but domain specific skills. For instance, a financial analyst can determine if from a dataset of financial transaction attributes can predict fraud or not.

Big Data platforms

The emergence of the open source Hadoop platform in 2006 introduced Big Data platforms on commodity hardware and networkl clusters to the world. Few years later Hadoop was adopted for data analytics in several data analysis organizations. The main focus of Hadoop was to enable analytic tasks on large dataset in a reliable manner it was not possible before. Meanwhile further improved platforms have been created, such as Apache Flink or Apache Spark, that focus not only on processing large data volumes, but processing them also faster by employing various optimization techniques.

Those platforms employ several nodes that communciate over a network to execute a task in a distributed manner. This imposes some challenges for machine learning, such as:

  • Standard machine learning algorithms are not designed to be executed over a distributed network of nodes.
  • Not all machine learning algorithms can be converted into a distributed one. For instance, if you need to estimate the parameters of a model then gradient descent might require a lot of memory on a single node. Hence, other estimation methods need to be used to work in parallel on different nodes.

This led to specific machine learning libraries, such as Flink ML or Spark Mlib, for those platforms that supported only a dedicated subset of algorithms that can be executed efficiently and effectively over a given set of nodes communicating via the network.

Tool support for AutoML

AutoML can be very useful for your organization. Amongst others the following tools exist.

Tool Description Supported Models Supported Hyperparameter optimization
Auto-SKLearn Automated machine-learning toolkit to be used in lieu of the non-automed scikit-learn. Budget can be defined by time and memory as well as search space. Automated preprocessing can be defined. Multicore processing is supported for some of the algorithms. Multiple classifiers and regressors as well as combinations (ensemble construction) Bayesian Optimization
TPOT Automated machine-learning toolkit offering various automated preprocessors (e.g. Standard Scaler, Principal Component Analysis). Multiple classifiers and regressors as well as combinations Genetic Programing
Auto-Weka Automated machine-learning toolkit. Budget can be defined by time and memory as well as search space. Multiple classifiers and regressors as well as combinations Bayesian Optimization
Devol Automated machine-learning toolkit for deep neural network architectures. Expects the data to be prepared and encoded. Sees itself more as support for experienced data scientists. Neural Networks and combinations. Genetic Programing
Machine.js/Auto-ML Automated machine learning kit based on auto-ml. Expects the data to be prepared and encoded. Various classifiers and regressor as well as Neural networks based on the Keras library. Supports combinations. Genetic Programing / Gridsearch

Most of these tools support only one method for hyperparameter optimization. However there are several methods. Some models do not require hyperparameter optimization, because they can derive optimal hyperparameter from the trained data. Unfortunately, this is currently integrated in none of the tools.

Those tools might not always be very end user friendly and you still need to deploy them in the whole organization as fat clients instead of light-weight HTML5 browser applications. As an alternative popular cloud provider integrating more assistants in the cloud that help you with your machine learning process.

AutoML on Big Data Platforms

The aforementioned tools have not been primarily designed for Big Data platforms. They usually are based on Python or Java, so one could use them with the Python or Java-bindings of those platforms (cf. Flink or Spark). One could use the available data sources (and efficient data formats such as ORC/Parquet) and sampling functionality of those platforms (e.g. sampling in Flink or sampling in Spark) and feed it into the tools that could even run on the cluster. However, they would only use one node and the rest of the cluster would not be utilized. Furthermore, the generated models are not necessarily compatible with the models provided by the Big Data platforms. Hence, one has to write a manual wrapper around those models so they can be distributed over the cluster. Nevertheless, also these models would only use one node.

This is not necessarily bad, because usually data scientists will not run one dataset to evaluate with AutoML but multiple datasets so you can utilize the whole cluster by running several AutoML processes. However, it also means data size as well as budget for time and memory is limited to one node, which might not be sufficient for all machine learning problems. Another interesting aspect could be to run one or more winning models over a much larger dataset to evaluate it in more detail or to optimize it even more. This would again require a more deep integration of the AutoML tools and Big Data platforms.

H2O AutoML is a recent (March 2018) example on how to provide AutoML on Big Data platforms, but this has currently similar limitations as described before with respect to the Big Data platform. The only advantage currently is that the models can be generated are compatible with the machine learning APIs of the Big Data platforms.

Apache Spark has some methods for hyperparameter tuning, but they are limited to a pipeline or model and do not cover aspects of other AutoML solutions, such as comparing different models. Furthermore, it only evaluates out a list of given sets of parameters and no time or cost budget definition is possible. This would have to be implemented by the user.

One can see that AutoML and Big Data platforms can benefit from a more tighter integration in the future. It would then also be more easy to leverage all the data in your data lake without extracting it and process it locally. At the moment, although some machine learning models are supported by Big Data platforms (with some caveats related to distributed processing) not all functionality of AutoML is supported (e.g. hyperparameter tuning or genetic programing).

Google AutoML provides an automated machine learning solution in the cloud. It augments it with pre-trained models (e.g. on image data), so not everyone has to spend time again to train models.

Conclusion

AutoML is a promising tool to facilitate the work of less skilled and very skilled machine learning persons to enable them to have more time focusing on their work instead of the manual error-prone labour-intensive machine learning process. You can make your organisation on-board on machine-learning even if you are in your early machine learning stages and facilitate learning of many people on this topic.

However, it does not mean that people without knowledge can use it out of the box. They should have a least a basic knowledge on how machine learning works in general. Furthermore, they should be domain experts in your organizations domain, e.g. a financial analyst in banks or an engineer in the car industry. They still need to decide which data to input to AutoML and if the model learned by AutoML is useful for your business purposes. For example, it does not make sense to put all the data of the world related to problematic situations for autonomous cars into an AutoML and expect that it can solve all problematic situation as best as possible. Moreover, it is more likely to be country/region-specific so it may make more sense to have several AutoML runs with different countries/regions to develop specific models for countries/regions. Other datasets, such as blockchains/cryptoledgers are more complex and require currently a manual preprocessing and selection.

Another example is known from spurious correlations, ie correlations that exists, but do not imply causality. In all this case you still need a domain expert that can judge if the model is a useful contribution for the organization.

All these things are related to the no free lunch theorem.

Even for highly-skilled machine learning persons AutoML can make sense. No-one can know all particularities of machine learning models and it simply takes a lot of time to evaluate them all manually. Some may even have their favorite models, which may mean other models are not evaluated although they may fit better. Or they simply do not have time for manual evaluation, so a preselection of candidate models can also be very useful.

One open issue is still how much time and resources you should let AutoML spend on a specific dataset. This is not easy to answer and here you may still need to experiment if the results are bad then you need to spend probably more.

Nevertheless, AutoML as a recent field still has a lot of room for improvements, such as full integration in Big Data machine learning platforms or support of more hyperparameter tuning algorithms as well as more user-friendly use interfaces, as pioneered by some cloud services.

Then, those models should be part of a continuous delivery pipeline. This requires unit testing and integration testing to avoid that the model has obvious errors (e.g. always returns 0) or that it does not fit into the environment in which it is used for prediction (e.g. web service cannot be called or model cannot be opened in R). Integration machine learning models into continuous delivery pipelines for productive use has not recently drawn much attention, because usually the data scientists push them directly into the production environment with all the drawbacks this approach may have, such as no proper unit and integration testing.

Blockchain Consensus Algorithms – Proof of Anything?

Blockchains have been proven over the last years to be stable distributed ledger technologies. Stable refers to the fact that they can recover from attacks and/or bugs without compromising their assets. They are most commonly known for enabling transaction with virtual cryptocurrencies not issued by a central authority. Popular examples are Bitcoin and Ethereum. However, both have been forked to create also alternative coins (Altcoins) having different features. For instance, Namecoin, the first fork of Bitcoin, is a cryptocurrency providing at the same time distributed domain name and identity without relying on a central authority. Thus, it is more resilient to censorship or potentially not democratically elected central authorities governing it.

Of course at the same time this new technology – as all new technologies – have some risk because they need to be properly understood by their users. Initial Coin Offers (ICO) that fork from popular blockchains (or not even do this) may be part of frauds or scams.

Hence, it is important to understand their key mechanisms and this blog post describes one of them: Establishing consensus of transactions happening on a blockchain. Without secure consensus it is possible to steal value (coins) or manipulate entries in the blockchain. The consensus is usually decided by a participant that can provide a proof of something. This something differ in the different consensus mechanisms.

About Blockchain Consensus

One should not confuse blockchain consensus with consensus in distributed systems. Consensus in distributed systems is about agreeing on a certain data value during computation. The idea is to reach a common state among several copies of data despite of failure, network partitioning or even manipulation by the replicas. Paxos and Raft are typical algorithms/protocols to reach consensus and they require to elect a leader, which can be anyone involved in the consensus and it stays as long as it not fails.

Blockchain consensus is an economic consensus which is different, the economic participants (which are in the end humans) have a common economic interest to reach economic consensus:

  • Preserve the value of their cryptocurrency assets. This especially means double-spending should not be allowed and can be detected.

  • Leader election: The entity deciding consensus on a given block in the blockchain gets superior benefits (besides value preservation). This means leader should change very frequently and a transparent mechanism for changing the leader should exist.

  • All participants of the economic consensus can verify that the leader made a correct decision (ie produced a valid block). This is also needed so that the leader cannot modify transactions , e.g. their value or destination.

  • Timely communication with all participants before the leader decides on a given block or for electing a leader is seen as infeasible.

  • Produce a block with high probability in given timeframes (e.g. within 10 minutes) to avoid that participants leave the blockchain due to lack to transfer their cryptocurrency assets at any given point in time.

Why would they have such an interest? They all have assets (financial/non-financial) that they want to preserve. Hence, most of the participants also do not have an incentive to cheat or that they fail to agree on a state of the ledger, because if cheating is detectable or preventable (and it is with an adequate blockchain consensus algorithm) then the value for all participants diminishes. Of course outsiders may want to destroy a blockchain and thus attack the whole blockchain, because they have no stake in it – this would be comparable to burning physical money of another currency.

Leader election is different here – the different participants do not necessarily trust each other and especially do not want that one stays leader for a long time, because the leader benefits heavily in economic terms, for example, fees that are charged and the leader can decide to include certain transaction only, e.g. those of a minority of peers (“friends”). This ultimately leads to a situation where they cannot preserve the value of their assets.

Additionally, it can happen in blockchains that participants do not agree on a common data value during computation and this leads to a fork of the blockchain. In this case this is temporary or permanent dependent on what the participants want. For instance, some participants may want – and that happen in the past – create a separate blockchain from one originally common blockchain with other participants (e.g. the split between Bitcoin Core and Bitcoin Classic or Ethereum Core and Ethereum Classic). Then, consensus might be based on different rules. In fact, such a situation has never been consider in distributed consensus.

As you can see blockchain consensus and consensus in distributed systems are very different. Economic, governance and human factors make it different from consensus in distributed systems. That is why successful blockchain consensus algorithms are fundamentally different from consensus algorithms in distributed systems. One famous example is the proof of work algorithm that is used in the Bitcoin Blockchain. Additionally, the application of consensus algorithms of distributed systems has not been very popular, although some blockchains try to employ them for closed networks with selected participants only, which makes using blockchain technology meaningless.

Algorithms for Proof of Anything

Proof of Work

Bitcoin as the first practical successful cryptocurrency introduced the Proof-of-Work (PoW). However, it was not the first one as the Bitcoin paper stated. HashCash and others have proposed a similar approach mainly to address the issue of junk mail. The idea there was that someone has to prove to that some investment have been taken before sending an email. This would make sending junk mails not rentable.

Basically the proof-of-work demonstrates that a participant has done some work and gets a reward.

For example, a simplified version of the proof-of-work in Bitcoin is that a block including relevant parts of the transaction are hashed and a random nonce is added to it so that the resulting hash is below a certain value (the difficulty).

PoW has the following characteristics:

  • It must be predictable hard to obtain. For example, Bitcoin has as a rule (which of course could be change if a majority would vote it) that on average every 10 minutes a new proof-of-work (ie a block) can be generated.

  • It must adapt to innovation. For instance, new more powerful hardware may make a proof-of -work obsolete if it does not become more difficult. If a PoW is generated to fast then the network can be subject to double-spending attacks or one participant might have a monopoly as a leader on the network. Additionally, it must be able to withstand new technology, such as Quantum Computers or ASICs.

  • It must adapt to network power. For example, if the difficulty grows too much and it is too hard to solve then cryptocurrency assets loose in value, because it will take too long to transfer ownership. Hence, difficulty of the PoW must be able to grow and shrink according to network capacity to solve it. This is the case for all cryptocurrencies including Bitcoin.

  • It should be equally difficult for anyone to generate it, ie there should be not a centralization of several entities that are able to generate a PoW. This is somehow not exactly a black and white thing, but more greyish, because even in Bitcoin this is currently not fully ensured due to the appearance of ASICs.

  • It must be extremely fast to verify that it has been done by any other participant.

  • It must not be possible to give completely new nodes a long fake chain to dissolve the network and make it attackable. In fact, Bitcoin contains checkpoints that are hardcoded as consensus rules, ie certain block hashes at given points in time are valid and thus new nodes can start validating from a later stage to early detect if a fake blockchain has been supplied to them or not. Since Bitcoin is Open Source this is a somewhat transparent mechanism to which all participants have to agree on.

Although most of the PoW systems are CPU bound, the characteristics do not prevent that it is bound by anything else, such as memory. Theoretically, one could also imagine other PoWs, such as based on information entropy, colliders, speed of light or quantum computing-specific aspects (example). However, such a PoW system must be available to all participants and satisfy the characteristics above. Nowadays most PoW systems are solving hash-based problems (e.g. SHA2-256, SHA3, Scrypt or mixtures of different hash algorithms).

One main critic point of PoW is that a lot of energy is “consumed”. I purposely do not write here wasted, because the PoW ensures functioning of the cryptocurrency and as we will see later no viable alternatives currently exists for public blockchains. Additionally, one should keep in mind that payment processing, clearing, physical money, credit cards, server energy etc. have also an energy footprint, but there has – at least known to me – never been a study on this to compare if the PoW is more energy hungry (first attempts exist, check cf. here).

However, there have been attempts to improve the PoW. For example, some cryptocurrency have as PoW a more or less meaningful problem (Proofs of Useful Work – PoUW), e.g. Primecoin searches for chains of prime numbers. Meaningful usually implies a mathematical problem, which is simple to describe, but fulfills some or even better all characteristics above. It needs to be simple to describe, because if it is complex to describe then it is complex to understand, difficult to test and prone to errors. Permacoin uses PoW for distributed storage of achival data, ie one needs to provide storage to solve the PoW. Gridcoin attempts to solve scientific problems.

Nevertheless, they may not be able to fulfill all the characteristics mentioned above, which explains their limited popularity for cryptocurrencies. However, there has never been – to my knowledge – a complete study and comparison of all these mechanisms including quality (testing!), ecological, economic and socio-economic effects.

Others try to reuse an existing PoW. In fact, in some sense the PoW is reused in Bitcoin, because if a transaction is included in a block its output can be reused in other transactions. Other approaches, such as merged mining, allow at the same time merging for different blockchains using the same work (e.g. Bitcoin and Namecoin).

Finally, another criticism of PoW is that it is slow. Usually the Bitcoin delay of 10 minutes on average for generating is cited. However, these 10 minutes are a deliberate decision by the originators of Bitcoin and is not a technological limit. In fact, at any time this could be changed by a majority to be more or less. However, having less time might have significant security and economic impact, which needs to be carefully weighted. Furthermore, with the existence of side-chains, such as the Lightning network, this rule can be probably avoided more elegantly and allow scaling to payment processing similar to popular payment networks.

Proof of Stake

Proof of Stake (PoS) is another way to establish economic consensus.

PoS is basically about voting on the next block in a blockchain based on the economic stake into the network. A stake could be for instance be determined based on a stake of a number of cryptocurrency assets in a locked deposit or the stake of CPU/memory/energy/etc. in the network. Variants of it includes differences between who can propose a new block (consensus) and who can vote on it. The idea is that someone who has a lot of stake will not do anything to endanger this stake, such as cheating, because then it would become less valuable.

However, PoS has not been as successful as Proof-of-Work. Currently, none of the large cryptocurrencies uses this. Nevertheless, for Ethereum it was initially assumed to be used instead of PoW (“Slasher”), but currently Ethereum only supports PoW (Ethash). The reason was according to the originators that proof-of-stake is non-trivial.

The characteristics of PoS are the following:

  • Votes on a new block are according to economic stake in the blockchain of a participant. However, it should be avoided that there is a centralization towards the “richest” participant. This is usually done by differentiating between block proposers (which might be random or according to another rule) and block validators (that have a stake)
  • Economic stake may change and is not fixed.
  • It should not be possible to revote once a vote has been done and exists in a network since some time, but not too long. It should not be possible to vote on several alternative chains of the same blockchain. This implies that the economic stake must be at risk in case of abuse (“nothing at stake problem”).
  • Nodes need to be online, ie connected to other nodes, to vote with a relevant stake.
  • The vote needs to terminate with an outcome (yes/no) after a certain short amount of time.

The main difference it seems is that reward for work is replaced by vote based on stakes. Somehow the PoS can be compared with shareholder votes.

One interesting question is how such stakes can be distributes initially. Some cryptocurrencies sell stakes on their initial offering for Fiat currencies or already working (“bootstrapped”) cryptocurrencies, such as Bitcoin. This has recently led to a number of fraud initial coin offerings (ICO). The reason is that it is virtually impossible for participants to find out if a cryptocurrency will be successfully adapted or not (or if it even exists), which implies a very high risk. Then, even afterwards wrong decisions can render a cryptocurrency valueless.

Several theoretical ideas have been proposed for PoS, but they rarely end up in public blockchains, because of the inherent issues which are non-trivial. It is significantly more complex to implement compared to a PoW in case of decentralized public blockchains. It involves potentially several roles (e.g. proposer, voter and validator) that need to communicate actively (in PoW it is passively). Furthermore, it can be (but not need to be) less transparent than PoW, because the stakes and their development over time might be difficult to monitor (here dedicated analytics software may help). Examples are the previously mentioned Slasher protocol, the new protocol proposed by Ethereum (Casper) or the minting by Peercoin (basically based on coin age).

Practical examples for PoS exists, such as Peercoin, but there is one disadvantage is that only one person (of unknown identity) has the ability to invalid any chain at any point in time from anywhere. The reason for this checkpointing mechanism was the nothing at stake problem. However, meanwhile this mechanism will be chnaged for Peercoin.

These practical examples are nonetheless not as successful currently as cryptocurrencies based on PoW.

However, it has also some advantages, such as potentially lower energy consumption or the setup of more sophisticated governance mechanisms (including everyone).

Proof of Burn

Proof of Burn (PoB) is currently only a theoretical concept that has appeared in the Bitcoin mailing lists as an alternative to PoW. It should be seen as work in progress, because it has not yet been written formally down and analyzed.

This should not be confused with burning coins of one cryptocurrency to create coins of another cryptocurrency. This would be a more complex scenario related to PoB.

PoB works as follows: Someone sends some amount of cryptocurrencies (e.g. Bitcoins) to a destination from which they cannot be used anymore, ie they become provable unspendable (hence the analogy of burning it). After a certain amount of time (e.g. two months) a participant can propose a new block and have as a proof the burned amount of cryptocurrency.

Some might ask why someone would do such a thing to spend money just to propose a new block. Remember what I said in the Proof-of-Work section: The proposer of a block gets superior benefits, such as transaction fees. Obviously, for this to work the burned amount of cryptocurrency must be lower then the transaction fees.

The proposal might not be as senseless as it looks like, because its supporters argue that even for PoW some money needs to be burned by buying equipment to do the proof of work.

Furthermore, it requires that a certain amount of cryptocurrency is already there (e.g. generated via PoW).

PoB has also further implications that are not yet well-understood. Very few preliminary implementations, such as Slimcoin based on Peercoin, exist that should be seen with care.

Proof of Elapsed Time

Proof of Elapsed Time (PoET) attempts to address the problem of proof-of-stake that random election of participants proposing blocks is needed to ensure that every participant has a fair chance to propose a block and thus generate superior benefits.

The idea is the following: Every participant requests a wait time from its local trusted enclave. The participant with the shortest wait time is next to propose a block, after it waited for the assigned waiting time.

Each local trusted enclave signs the function and the outcome so that other participants can verify that none has cheated on the wait time.

As such it seems and it has been claimed by the people proclaiming PoET that it fulfills the characteristics of PoS described above.

The concept was first proposed by Intel as part of its HyperLedger Sawtooth Open Source technology, which is no surprise, because it is another use case for its SGX technology providing the enclave.

Although the approach does not prevent mixing or using other secure enclaves besides the Intel one, it has – to my knowledge – not yet been proposed (e.g. based on AMD Secure Memory Encryption (SME)/Secure Encrypted Virtualization (SEV)).

There are some things that you need to be aware of using this approach:

  • The secure enclave is rather complex technology which makes breaking it potentially easier than cheating in PoW.

  • In order for participants to verify that a secure enclave has provided the value they rely on a third party trusted certification authority or web of trust that signs the keys of a secure enclave. Hence, there is a clear tendency towards centralization, which is avoided in other PoW or PoS scenarios.

Practical Byzantine Fault Tolerance

Practical Byzantine Fault Tolerance (PBFT) is a consensus algorithm which is normally used for consensus in distributed system and as argued before does not really fulfill the requirements for economic consensus in blockchains.

Since PBFT becomes infeasible in networks with a number of nodes due to the required communication, blockchain technologies using PBFT only rely on a trusted subnetwork of participants to establish consensus (e.g. unique node list for each participant in Ripple). This poses some problems:

  • How large should this list be and how should a “normal” participant know who to include in its trusted network?
  • How can a participant detect forks of the blockchain (e.g. servers changing their trusted subnetwork)?
  • What is the incentive for a participant to participate in a consensus? There is no transaction fee per se foreseen in the consensus.

PBFT is advocated by a few blockchain technologies, such as Ripple as described here or Stellar. There, the use case however is also slightly different, it is more about connecting large banking networks and not anyone as in other blockchain technologies. Hence, most of the questions stated before may have a clear convincing answer. Additionally, transaction fees are introduced by burning a certain amount of currency in each transaction – none of the participants has access to the burned amount of currency. This is used to avoid that the blockchain is flooded with large number of transactions to render it useless or to get economic benefits from it. Hence, for these kind of special blockchains with a specialized set of participants this mechanism can still make sense.

Conclusion

Blockchains have been proven as mature technologies as public examples, such as Bitcoin or Ethereum demonstrate. However, from an Economic perspective not all mechanisms are well understood, especially due to the huge variety of concepts and their rapid development. This had also let to frauds of fake cryptocurrencies and blockchains as part of certain initial coin offerings (ICO). Furthemore, different type of participants in different types of blockchains making it even more difficult to understand the context.

Although it seems that PoW is dominating now, it is more suitable for public blockchains independent of any central entity. This might not be desired, because a central entity can ensure with right policies that every participant as access to the blockchain, protected from other participants and the same rights as well as responsibilities, similar as it is already now with Fiat money. Hence, PoS systems may gain more traction because they have a more flexible governance model than PoW. They could evolve in a system of proof of mastery, e.g. a certain subset of participants in a blockchain proposes new blocks, because they have been delegated this task by all the other participants. This subset of participants will use open source software, analytics on the blockchain and provide transparent mechanisms as well as information to all the other participants that delegated their stake to them.

However, due to the inherent challenges of PoS, combined systems out of PoW/PoS/PoB may be ultimately the successful one. There seems to be a tendency towards this (e.g. Casper for the Ethereum blockchain). Given the different approaches a lot of different combinations are possible. For instance, one can have PoS for the “daily” business of creating blocks, but PoW for checkpointing the blockchain at certain points in time.

Nevertheless, all these systems can only be successful and transparent if powerful analytics software is available to any participant, so they can track the effectiveness of decisions within certain blockchain technologies and derive appropriate consequences out of it.

Keep in mind that not only the Proof of Anything (PoW, PoS, PoB etc.) is here a challenge, but also other powerful groups, such as developer who can write blockchain technology – they tend also towards centralization of a single group and they have a lot of power.

In the future, we will see more cross blockchain activities challenging how cryptocurrencies from one chain can end up in another chain (e.g. via PoB). Similar to the nowadays exchanges for Fiat money, there will be always the need for exchanging different cryptocurrencies. There will be not one cryptocurrency, but always many due to different interests of participants or embarkment on new technologies.

Furthermore, we will need to deal with automated non-human participants within the blockchain. Robots or “things” may have a certain amount of cryptocurrencies to perform tasks, e.g. an autonomous car that needs to pay highway tolls (assuming there is no reason anymore that an automated car is “owned” but exists on itself and makes money by bringing passengers from A to B). These types of participants may have different incentives/requirements of economic consensus.

Spending Time on Quality in Your Big Data Open Source Projects

Open source libraries are nowadays a critical part of the economy. They are used in commercial and non-commercial applications directly or indirectly affecting virtually any human being. Ensuring quality should be at the heart of each open source project. Verifying that an open source project ensures quality is mandatory for each stakeholder of this project, especially its users. I describe in this post how software quality can be measured and ensured when integrating them in your application. Hence, we focus only on evaluating the technical quality of the library, because it suitability for business purposes has to be evaluated within the regular quality assurance process of your business.

The motivation for this post is that other open source projects will employ also transparent software quality measures so that eventually all software becomes better. Ideally, this information can be pulled by application developers to have a quality overview on their software, but also its dependencies. The key message is that you need to deploy solutions to efficiently measure software quality progress of your applications including dependencies and make the information transparently available in a dashboard (see screenshot) for everyone in near realtime.

softwarequalitydashboard

Although I write here about open source software, you should push your commercial software providers to at least provide the same level of quality assurance and in particular transparency. This also includes the external commercial and open source libraries they depend on.

Software quality is “The totality of functionality and features of a software product that bears on its ability to satisfy stated or implied needs” (cf. ISTQB Glossary). This definition is based on the ISO9216 standard (succeeded by ISO/IEC 25000), which describes various quality classes, such as functionality, reliability, usability, efficiency, maintainability and portability.

I chose two open source libraries in which I am deeply involved to demonstrate measures to increase their quality:

  • HadoopCryptoLedger: A library for processing cryptoledgers, such as the Bitcoin Blockchain, on Big Data platforms, such as Hadoop, Hive and Spark

  • HadoopOffice: A library for processing office documents, such as Excel, on Big Data platforms, such as Hadoop and Spark

Before you think about software quality, one important task is to think about what functionality and which business process that is supported by the application brings the maximum value to the customer. Depending on the customer needs, the same application can bring different value to different customers. When looking at the supported business process you also must consider that not all potential execution variants of it should be evaluated, but only the most valuable for the business.

At the same time, you should evaluate the risk that the application fails and how it impacts the business process. If the risk is low or you have equally mitigation measures (e.g. insurance) then you can put there less emphasis compared to a business process where the risk is high and no mitigation measures exist. Additionally, I also recommend to look at value of functionality. If a functionality provides a lot of value then you should test it much more. Sometimes this change of view from risk to value uncovers more important areas of quality of a software product.

Steps for measuring quality

One of the core requirements for increasing software quality is to measure quality for third party libraries. It implies that you have mapped and automated your software development and deployment process (this is also sometimes referred to as DevOps). This can be done in a few steps:

  1. Have a defined repeatable build process and execute them in an automated fashion via a continuous integration tool.

  2. Version your build tool

  3. Software repository not only for source code

  4. Integrate automated unit and integration tests into the build process.

  5. Download dependencies, such as third party libraries, from an artifact management server (repository) that manage also their versions

  6. Evaluate quality of test cases

  7. Code documentation

  8. Code Analytics: Integrate static code analysis into the build process.

  9. Security and license analysis of third party libraries

  10. Store build and tested libraries in a versioned manner into an artifact management server (repository)

  11. Define a Security policy (security.md)

  12. Define a contribution policy (contributing.md)

  13. Ensure proper amount of logging

  14. Transparency about the project activity

Of course, these are only technical means for measuring quality. Out of scope here, but equally important is to define and version requirements/use cases/business process models, system tests, acceptance tests, performance tests, reliability tests and other non-functional tests. Additionally you should take into account automated deployment and deployment testing, which is out of scope of this article.

Evaluate quality

Given the aforementioned points, we describe now how they have been realized for the hadoopoffice and hadoopcryptoledger. Of course, they can be realized also with different tools or methods, but they provide you some indicators on what to look for.

1. Have a defined repeatable automated continuous integration process

We use for the hadoopoffice and hadoopcryptoledger library the Continuous Integration Platform Travis CI, which is offered as a free service for open source projects in the cloud. The aforementioned projects include in the source code repository root folder a file named .travis.yml (example), which describes the build process that is executed by travis. Everytime new code is committed the library is build again including the examples and automated unit and integration tests take place.

Travis CI provides a container infrastructure that allows executing the build process in so called software containers, which can integrate flexible different types of software without the overhead of virtual machines. Container can be pulled from repositories and flexible composed together (e.g. different Java Development Kits, operating systems etc.). Additionally, one wastes the costs of virtual machines being idle and capacity not used. Furthermore, you can track previous execution of the build process (see example here) to measure its stability.

Gradle is used for nearly every step in the process to build the application, generate docuemntation, execute unit and integration tests and collect reports.

Shell scripts are used to upload the HTML reports and documentation to a branch in Github (gh-pages) so that it can be easily found in search engines by the users and viewed in the browser.

2. Version your build tool

As said, our main build tool is Gradle. Gradle is based on the popular Java Virtual Machine scripting language Groovy and thus it scales flexible to your build process as well as make build scripts more readable in comparison to XML based tools, which are inflexible and a nightmare to manage for larger projects. Furthermore, plugins can be easily implemented using Groovy.

Modern build tools are continuously evolving. Hence, one want to test new versions of the build tool or upgrade to new versions rather smoothly. Gradle allows this using Gradle Wrapper – you simply define the desired version and it downloads it (if not already done) to be executed for your current build process. You do not need any manual/error prone installation.

3. Software Version Repository not only for source code

One key important ingredient of the software development process is the software version repository. Most of the m now are based on GIT. It is extremely flexible and scale from single person projects to projects with thousands of developers, such as the Linux kernel. Furthermore, it allows you to define your own source code governance process. For example, the source code of open source libraries that I introduced in the beginning are hosted on a GIT server in the cloud: Github. Alternatives are Gitlab or Bitbucket.org, but also many more. These cloud solutions also allow you having private repositories fully backed up for a small fee.

As I said, these repositories are not only for your software source code, but also your build scripts and continuous integration scripts. For example, for the aforementioned software libraries it stores the Travis scripts and the build files as well as test data. Large test data files (e.g. full blockchains) should not be stored in the version management system, but in a dedicated cloud object storage, such as Amazon S3 or Azure Storage. Here you have also the possibility to store them on ultra-low cost storage which might take longer to access the files, but this can be alright for a daily CI process execution.

4. Integrate automated unit and integration tests into the build process.

Another important aspect of software delivery is the software quality management process. As a minimum you should have not only automated unit and integration tests, but also a daily reporting on the coverage.

The reporting for the aforementioned library is done via Sonarqube.com and Codacy. Both not only report results on static code analysis, but also on unit/integration test coverage. Two have been chosen, because Codacy supports Scala (a programing language used in part of the aforementioned libraries).

More interesting, you find on both platforms also result for popular open source libraries. Here you can check for the open source libraries that you use in your product how good they are and how they improve quality over time.

Of course there are many more cloud platforms for code coverage and quality analysis (e.g. SourceClear).

Another important method for testing beyond unit testing for developers is integration testing. The reason is that complex platforms for Big Data, such as Hadoop, Flink or Spark, have emerged over the past years. These tests are necessary, because unit tests cannot reflect properly the complexity of these platforms and system tests are not focusing on technical aspects, such as code coverage. Furthermore, the execution of integration tests is very close to the final execution on platforms, but requires less resources and virtually no installation effort (an example is Spring Boot for web applications). However, still unit tests are mandatory, because they are very suitable to verify quickly that calculations with many different inputs are done correctly automatically.

Nevertheless, these integration tests require additional libraries. For example, for the aforementioned library, which are supporting all major Big Data platforms, the Hadoop integration tests suites are used as a base. They provide small miniclusters that can be even executed on our CI infrastructure based on containers with very little configuration and no installation efforts. Of course we use also the testing features of the platforms on top of this, such as the one for Spark or Flink.

5. Download dependencies, such as third party libraries, from an artifact management server (repository) that manage also their versions

It should be a no-brainer, but for all of your third party libraries that you include in your application you should manage them an artifact management server. Popular choices are Nexus or Artifactory. Both offer also cloud solutions for open source and private libraries (e.g. Maven Central or Bintray). The public cloud solutions should be analysed with care, for example the NPM.js gate security affair for Javascript packages or the PyPi issue for Python packages, where attackers replaced popular packages or uploaded packages with similar names to popular packages to trick developers in using malicious libraries in their application code. Having a local artifact management server quickly indicates you which of your developed applications are exposed and you can initiate mitigation measures (e.g. directly communciate to the project which library is affected).

These repositories are important, because you do not need to manage access to third party libraries in your software project. You can simply define in your build tool the third party library as well as the version and it will simply download it during the build process. This allows you simply changing and testing new versions.

Furthermore, having an artifact management server allows you to restrict access to third party libraries, e.g. ones with known security issues or problematic licenses. This means you can easily make your organisations more safe and you can detect where in your software products are libraries with issues.

Especially with the recent trend of self-service by end users in the Big Data world it is crucial that you track which libraries they are using to avoid vulnerabilities and poor quality.

6. Evaluate quality of test cases

As I wrote before, quality of test cases is crucial. One key important aspect of quality is coverage, ie to which extend your tests cover the source code of your software. Luckily, this can easily be measured with tools, such as Jacoco. They give you important information on statement and condition coverage.

There is no silver bullet how much coverage you need. Some famous software engineers say you should have at least 100% statement coverage, because it does not make sense that someone uses/tests a piece of software if not even the developers themselves have tested it. 100% might be difficult especially in cases of rapidly changing software.

In the end this is also a risk/value estimation, but having high statement coverage is NOT costly and saves you later a lot of costs with bugs/generally better maintainable software. It often also uncovers aspects that someone has not thought about before and require clarification, which should happen as early as possible.

For the aforementioned libraries we made the decision that we should have at least 80% statement coverage with the aim to have 100% coverage. Condition coverage is another topic. There are some guidelines, such as DO-178B/DO-178C for avionics or ISO 26262 for automotive that describe a certain type of condition coverage for critical software.

For the aforementioned libraries, there was no such guideline or standard that could be applied. However, both libraries are supporting Big Data applications and these applications, depending on their criticality will need to be tested accordingly.

Finally, another measure for quality of test cases is mutation testing. This type of evaluation basically modifies the software under test to see if the test cases still cover a mutated software. This is very useful in case a method changes frequently, is used in a lot of parts of the application and thus is an important piece. I found it rather useful for the utility functions of the HadoopCryptoledger library that parse binary blockchain data to get some meaningful information of it (cf. based on pitest). In this way you can cover more of the software with higher quality test cases.

7. Code documentation

Documentation of code is important. Of course, ideally the code is self-documenting, but with comments one can easier navigate in code and gets the key concepts of the underlying software. Everybody who has some more in-depth development exposure to popular big data platforms appreciates the documentation (but also of course the source code) shipped with it.

Additionally, one can automatically generate a manual (e.g. by using Javadoc or Scaladoc) that is very useful for developers to leverage the functionality of your library (e.g. find an example of the HadoopCryptoLedger library here). For the aforementioned libraries the documentation is automatically generated and stored in the version management branch “gh-pages”. From there Github pages takes the documentation and makes them available as web pages. Then, it can be found by search engines and viewed in the browser by the users.

The latter point is very important. Usually the problem is not that the developers do not write documentation, but publishing the documentation in time. The documentation for the aforementioned library is published every build in the most recent version. Usually in company it is not the case, it is a manual process that is infrequently or not done at all. However, solutions exist here, for example Confluence and the Docs plugin or the Jenkins Javadoc plugin, but also many others exist.

Furthermore, one should look beyond standard programing language documentation. Nowadays you can integrate UML diagrams (example) and markdown language (example) in your source code comments, which means that your technical specification can be updated every build eg at least once a day!

Finally, tools such as Sonarqube or Openhub can provide you information on the percentage of documentation in comparison to other software.

8. Code Analytics: Integrate static code analysis into the build process.

Besides testing, static code analysis can illustrate areas of improvements for your code. From my experience, if you start right away with it then it is a useful tool with very few false positive.

Even if you have a project that has grown over 10 years and did not start with code analysis, it can be a useful tool, because it shows also for newly added or modified code the quality and this one you can improve right away. Of course code that has never been touched and seems to work already since many years should be carefully evaluated if it makes sense to adapt to recommendation by a static code analysis tool.

Static code analysis is very useful in many areas, such as security or maintainability. It can show you code duplication, which should be avoided as much as possible. However, some code may always have some duplications. For example, for the aforementioned libraries two Hadoop APIs are supported that do essentially the same thing: mapred and mapreduce. Nevertheless, some Big Data tools, such as Hive, Flink or Spark use one API, but not the other one. Since, they are very similar some duplications exist and have to maintained in any case. Nevertheless, in most of the cases duplications should be avoided and remvoed.

One interesting aspect is that popular static code analysis tools in the cloud, such as Codacy, SourceClear or Sonarqube shows you also the quality or even better the evolution of quality of many popular open source projects, so that you can easily evaluate if you want to include them or not.

9. Security & License analysis of third party libraries

Many open source libraries use other open source libraries. The aforementioned libraries are no exception. Since they provide services to Big Data applications, they need to integrate with the libraries of popular Big Data platforms.

Here, it is crucial for the developers and security managers to get an overview of the libraries the application dependent on including the transitive dependencies (dependencies of dependencies). Then, ideally the dependencies and their versions are identified as well as matched to security databases, such as the NIST National Vulnerability Database (NVD).

For the aforementioned libraries the OWASP dependency check is used. After each build the dependencies are checked and problematic ones are highlighted to the developer. The dependency check is free, but it identifies dependencies and matches them to vulnerability based on the library name. This matching is not perfect and requires that you use a build tool with dependency management, such as Gradle, where there are clear names and versions for libaries. Furthermore, the NVD entries are not always of good quality, which makes matching even more challenging.

Commercial tools, such as Jfrog Xray or Nexus IQ have their own databases with enriched security vulnerability information based on various sources (NVD, Github etc.). Matching is done based on hashes and not names. This makes it much more difficult to include a library with wrong version information (or even with manipulated content) in your application.

Additionally, they check the libraries for open source licenses that may have certain legal implications on your organization. Licensing is a critical issue and you should take care that all source code files have a license header (this is the case for the libraries introduced in the beginning).

Finally, they can indicate good versions of a library (e.g. not snapshots, popular by other users, few known security vulnerabilities).

Since the aforementioned libraries introduced in the beginning are just used by Big Data applications, only the free OWASP dependency check is used. Additionally, of course the key important libraries are kept in sync with proper versions. Nevertheless, the libraries itself do not contain other libraries. When a Big Data application is developed with these libraries the developer has to choose the dependent libraries for the Big Data platform.

The development process of Big Data application that use these libraries can employ a more sophisticated process based on their needs or risk exposure.

10. Store build and tested libraries in a versioned manner into an artifact management server (repository)

Since you retrieve already all your libraries in the right version for the artifact management server, you should also make your own libraries or application modules available there.

Our libraries introduced in the beginning are stored on Maven central in a versioned fashion where the developer can pull them in the right version for their platform.

You can also think for your Big Data application to make it available on an on-premise or in the cloud artifact management server (private or publicly). This is very useful for large projects where an application is developed by several software teams and you want to, for example, test integration of new versions of a module of an application and easily switch back if they do not work.

Furthermore, you can use it as part of your continuous deployment process to provide the right versions of the application module for automated deployment up to the production environment.

11. Define a Security policy (security.md)

A security policy (example) is a shared understanding between developers and users on security of the software. It should be stored in the root folder of the source code so that it can be easily found/identified.

As a minimum it defines the security reporting process of issues to the developers. For instance, it should be not public – in the beginning – as issues that everyone can see. The reason is that developers and users need to have some time to upgrade to the new software. After some grace period after fixing the issue the issue of course should be published (depending on how you define it in your security policy).

Depending on your software you have different security policies – there is no unique template. However certain elements can be applied to many different types of software, such as security reporting process, usage of credentials, encryption/signing algorithms, ports used by software or anything that might affect security.

Although many open source software has nowadays such a policy it is by far not common or easy to find. Furthermore, there is clearly a lack of standardization and awareness of security policies. Finally, what is completely missing is automatically combining different security policies of the libraries your application depends on to make a more sophisticated security analysis.

12. Define a contribution policy (contributing.md)

Open source libraries live from voluntary contributions of people from all over the world. However, one needs to define a proper process that defines minimum quality expectations of contributions and description on how people can contribute. This is defined in a policy that is stored in the root folder of the source code (example).

This also avoids conflicts if the rules of contribution are clear. Source code repositories, such as Github, show the contribution policy when creating issues or when creating pull requests.

13. Ensure proper amount of logging

Logging is necessary for various purposes. One important purpose is log analytics:

  • Troubleshooting of issues

  • Security analysis

  • Forensics

  • Improving user interactions

  • Proactive proposal for users what to do next based on other users behavior

Finding the right amount of logging is crucial. Too much logging decreases system performance, too little logging gives not the right information. Different configurable log levels supports tuning of this effort, however, usually the standard log levels (error, warn, info, debug etc.) are used and they do not distinguish the use case mentioned before. Even worse, there is usually no standard across an enterprise. Thus they do not take into account the integration of development and operation. For instance, if there is a warning then operation should know if the warning needs to be taken care of immediately or if it has time to prioritize their efforts.

This also implies that there is a need to have standards for logging in open source libraries, which do not exist. As a first step they would need to document their logging approach to get this discussion in the right direction.

Finally, although certain programing languages such as Java with log4j have rather sophisticated logging support this is not necessarily the case for other programing languages, especially high level languages, such as R or Pig. Here you need to define standards and libraries across your enterprise.

14. Transparency about the project activity

This is the last point (in this blog), but also one of the more important ones. Currently, organizations do not collect any information of the open source libraries they are using. This is very risky, because some open source projects may show less and less activity and are abandoned eventually. If an organizations puts a high valuable business process on them then this can cause a serious issue. In fact many applications (even commercial ones) consists nowadays to a large extent of open source libraries.

Additionally, you may want to know how much effort is put into an open source library, how frequently it changes, the degree of documentation etc. Of course, all these should be shown in context with other libraries.

OpenHub is an example for such a platform that exactly does this (find here an example based on one of the aforementioned libraries). Here you can see the activity and the community of a project based on data it pulls out of the source code repository Github. It provides cost estimates (what would it cost to develop the software yourself) and security information.

Several popular open source projects have been scanned by OpenHub and you can track the activity over years to see if it is valuable to include a library or not.

Conclusion

Most of the aspects presented here are not new. They should been part of your organization since several years. Here, we put some emphasis on how open source developers can increase quality and its transparency using free cloud services. This may lead to better open source projects. Furthermore, open source projects have been always the driver for better custom development projects in company by employing best practices in the area of continuous integration and deployment. Hence, the points mentioned here should also serve as a checklist to improve your organizations software development quality.

Software quality increases the transparency on your software product which is the key of successful software development. This itself is a significant benefit, because if you do not know the quality then you do not have quality.

There are a lot of tools supporting this and even if you have to develop unit/integration tests for 100% coverage this usually does not take more than 5-10% of your developers time. The benefits are, however, enormous. Not only that the developer and users better understand the product, they are also more motivated to see the coverage and keep a high standard of quality in the product. If you software provider tells you that unit/integration tests are not worth the effort or that they do not have time for it – get rid of your software provider – there is no excuse. They must make quality information available in real time: This is the standard.

Given the increase of automation and documentation of software products, I expect a significant increase of productivity of developers. Furthermore, given the converge of Big Data platforms, automated assistants (bots) will guide users to develop themselves more easily and quicker high quality analysis depending on their data. Hence, they will be more empowered and less dependent on external developers – they will become developers themselves with the guidance of artificial intelligence doing most of their work to focus more on valuable activities.

Nevertheless, this depends also on their discipline and the employment of scientific methods in their analysis and execution of business processes.

Lambda, Kappa, Microservice and Enterprise Architecture for Big Data

A few years after the emergence of the Lambda-Architecture several new architectures for Big Data have emerged. I will present and illustrate their use case scenarios. These architectures describe IT architectures, but I will describe towards the end of this blog the corresponding Enterprise Architecture artefacts, which are sometimes referred to as Zeta architecture.

Lambda Architecture

I have blogged before about the Lambda-Architecture. Basically this architecture consists of three layers:

  • Batch-Layer: This layer executes long-living batch-processes to do analyses on larger amounts of historical data. The scope is data from several hours to weeks up to years. Here, usually Hadoop MapReduce, Hive, Pig, Spark or Flink are used together with orchestration tools, such as Oozie or Falcon.

  • Speed-Layer/Stream Processing Layer: This layer executes (small/”mini”) batch-processes on data according to a time window (e.g. 1 minute) to do analyses on larger amounts of current data. The scope is data from several seconds up to several hours. Here one may use, for example, Flink, Spark or Storm.

  • Serving Layer: This layer combines the results from the batch and stream processing layer to enable fast interactive analyses by users. This layer leverages usually relational databases, but also NoSQL databases, such as Graph databases (e.g. TitanDB or Neo4J), Document databases (e.g. MongoDB, CouchDB), Column-Databases (e.g. Hbase), Key-Value Stores (e.g. Redis) or Search technologies (e.g. Solr). NoSQL databases provide for certain use cases more adequate and better performing data structures, such as graphs/trees, hash maps or inverse indexes.

In addition, I proposed the long-term storage layer to have an even cheaper storage for data that is hardly accessed, but may be accessed eventually. All layers are supported by a distributed file system, such as HDFS, to store and retrieve data. A core concept is that computation is brought to data (cf. here). On the analysis side, usually standard machine learning algorithms, but also on-line machine learning algorithms, are used.

As you can see, the Lambda-Architecture can be realized using many different software components and combinations thereof.

While the Lambda architecture is a viable approach to tackle Big Data challenges different other architectures have emerged especially to focus only on certain aspects, such as data stream processing, or on integrating it with cloud concepts.

Kappa Architecture

The Kappa Architecture focus solely on data stream processing or “real-time” processing of “live” discrete events. Examples are events emitted by devices from the Internet of Things (IoT), social networks, log files or transaction processing systems. The original motivation was that the Lambda Architecture is too complex if you only need to do event processing.

The following assumptions exists for this architecture:

  • You have a distributed ordered event log persisted to a distributed file system, where stream processing platforms can pick up the events

  • Stream processing platforms can (re-)request events from the event log at any position. This is needed in case of failures or upgrades to the stream processing platform.

  • The event log is potentially large (several Terabytes of data / hour)

  • Mostly online machine learning algorithms are applied due to the constant delivery of new data, which is more relevant than the old already processed data

Technically, the Kappa architecture can be realized using Apache Kafka for managing the data-streams, i.e. providing the distributed ordered event log. Apache Samza enables Kafka to store the event log on HDFS for fault-tolerance and scalability. Examples for stream processing platforms are Apache Flink, Apache Spark Streaming or Apache Storm. The serving layer can in principle use the same technologies as I described for the serving layer in the Lambda Architecture.

There are some pitfalls with the Kappa architecture that you need to be aware of:

  • End to end ordering of events: While technologies, such as Kafka can provide the events in an ordered fashion it relies on the source system that these events are indeed delivered in an ordered fashion. For instance, I had the case that a system in normal operations was sending the events in order, but in case of errors of communication this was not the case, because it stored the events it could not send and retransmitted them at a certain point later. Meanwhile if the communication was established again it send the new events. The source system had to be adapted to handle these situations correctly. Alternatively, you can only ensure a partial ordering using vector clocks or similar implemented at the event log or stream processing level.

  • Delivery paradigms on how the events are delivered (or fetched from) to the stream processing platform

    • At least once: The same event is guaranteed to be delivered once, but the same events might be delivered twice or more due to processing errors or communication/operation errors within Kafka. For instance, the stream processing platform might crash before it can marked events as processed although it has processed them before. This might have undesired side effects, e.g. the same event that “user A liked website W” is counted several times.

    • At most once: The event will be delivered at most once (this is the default Kafka behavior). However, it might also get lost and not be delivered. This could have undesired side effects, e.g. the event “user A liked website W” is not taken into account.

    • Once and only once: The event is guaranteed to be delivered once and only once. This means it will not get lost or delivered twice or more times. However, this is not simply a combination of the above scenarios. Technically you need to make sure in a multi-threaded distributed environment that an event is processed exactly once. This means the same event needs to be (1) only be processed by one sequential process in the stream processing platforms (2) all other processes related to the events need to be made aware that one of them already processes the event. Both features can be implemented using distributed system techniques, such as semaphores or monitors. They can be realized using distributed cache systems, such as Ignite, Redis or to a limited extent ZooKeeper. Another simple possibility would be a relational database, but this would quickly not scale with large volumes.
      • Needles to say: The source system must also make sure that it delivers the same event once and only once to the ordered event log.

  • Online machine learning algorithms constantly change the underlying model to adapt it to new data. This model is used by other applications to make predictions (e.g. predicting when a machine has to go into maintenance). This also means that in case of failure we may temporary have an outdated or simply a wrong model (e.g. in case of at least once or at most once delivery). Hence, the applications need to incorporate some business logic to handle this (e.g do not register a machine twice for maintenance or avoid permanently registering/unregistering a machine for maintenance)

Although technologies, such as Kafka can help you with this, it requires a lot of thinking and experienced developers as well as architects to implement such a solution. The batch-processing layer of the Lambda architecture can somehow mitigate the aforementioned pitfalls, but it can be also affected by them.

Last but not least, although the Kappa Architecture seems to be intriguing due to its simplification in comparison to the Lambda architecture, not everything can be conceptualized as events. For example, company balance sheets, end of the month reports, quarterly publications etc. should not be forced to be represented as events.

Microservice Architecture for Big Data

The Microservice Architecture did not originate in the Big Data movement, but is slowly picked up by it. It is not a precisely defined style, but several common aspects exist. Basically it is driven by the following technology advancements:

  • The implementation of applications as services instead of big monoliths
  • The emergence of software containers to deploy each of those services in isolation of each other. Isolation means that they are put in virtual environments sharing the same operating systems (i.e. they are NOT in different virtual machines), they are connected to each other via virtualized networks and virtualized storage. These containers leverage much better the available resources than virtual machines.
    • Additionally the definition of repositories for software containers, such as the Docker registry, to quickly version, deploy, upgrade dependent containers and test upgraded containers.
  • The deployment of container operating systems, such as CoreOS, Kubernetes or Apache Mesos, to efficiently manage software containers, manage their resources, schedule them to physical hosts and dynamically scale applications according to needs.
  • The development of object stores, such as OpenStack Swift, Amazon S3 or Google Cloud Storage. These object stores are needed to store data beyond the lifecycle of a software container in a highly dynamic cloud or scaling on-premise environment.
  • The DevOps paradigm – especially the implementation of continuous integration and delivery processes with automated testing and static code analysis to improve software quality. This also includes quick deliveries of individual services at any time independently of each other into production.

An example of the Microservice architecture is the Amazon Lambda platform (not to be confused with Lambda architecture) and related services provided by Amazon AWS.

Nevertheless, the Microservice Architecture poses some challenges for Big Data architectures:

  • What should be a service: For instance, you have Apache Spark or Apache Flink that form a cluster to run your application. Should you have for each application on them a dedicated cluster out of software container or should you provide a shared cluster of software containers. It can make sense to have the first solution, e.g. a dedicated cluster per application due to different scaling and performance needs of the application.
  • The usage of object stores. Object stores are needed as a large scale dynamically scalable storage that is shared among containers. However, currently there are some issues, such as performance and consistency models (“eventually consistent”). Here, the paradigm of “Bring Computation to Data” (cf. here) is violated. Nevertheless, this can be mitigated either by using HDFS as a temporal file system on the containers and fetching the data beforehand from the object store or use an in-memory caching solution, such as provided by Apache Ignite or to some extend Apache Spark or Apache Flink.

I see that in these environments the role of software defined networking (SDN) will become crucial not only in cloud data centers, but also on-premise data centers. SDN (which should NOT be confused with virtualized networks) enables centrally controlled intelligent routing of network flows as it is needed in dynamically scaling platforms as required by the Microservice architecture. The old decentralized definition of the network, e.g. in form of decentralized routing, does simply not scale here to enable optimal performance.

Conclusion

I presented here several architectures for Big Data that emerged recently. Although they are based on technologies that are already several years old, I observe that many organizations are overwhelmed with these new technologies and have some issues to adapt and fully leverage them. This has several reasons.

One tool to manage this could be a proper Enterprise Architecture Management. While there are many benefits of Enterprise Architecture Management, I want to highlight the benefit of managed of managed evolution. This paradigm enables to align business and IT, although there is a constant independent (and dependent) change of both with not necessarily aligned goals as illustrated in the following picture.

enterprise-architecture-managed-evolution

As you can see from the picture both are constantly diverging and Enterprise Architecture Management is needed to unite them again.

However, reaching managed evolution of Enteprise Architecture requires usually many years and business as well as IT commitment to it. Enterprise Architecture for Big Data is a relatively new concept, which is still subject to change. Nevertheless some common concepts can be identifed. Some people refer to Enterprise Architecture for Big Data also as Zeta Architecture and it does not only encompass Big Data processing, but in context of Microservice architecture also web servers providing the user interface for analytics (e.g. Apache Zeppelin) and further management workflows, such as backup or configuration, deployed in form of containers.

This enterprise architecture for Big Data describes some integrated patterns for Big Data and Microservices so that you can consistently document and implement your Lambda, Kappa, Microservice architecture or a mixture of them. Examples for artefacts of such an enterprise architecture are (cf. also here):

  • Global Resource Management to manage the physical and virtualized resources as well as scaling them (e.g. Apache Mesos and Software Defined Networking)

  • Container Management to deploy and isolate containers (e.g. Apache Mesos)

  • Execution engines to manage different processing engines, such as Hadoop MapReduce, Apache Spark , Apache Flink or Oozie

  • Storage Management to provide Object Storage (e.g. Openstack Swift), Cache Storage (e.g. Ignite HDFS Cache), Distributed Filesystem (e.g. HDFS) and Distributed Ordered Event Log (e.g. Kafka)

  • Solution architecture for one or more services that address one or more business problems. It should be separated from the enterprise architecture, which is focusing more on the strategic global picture. It can articulate a Lambda, Kappa, Microservice architecture or mixture of them.

  • Enterprise applications describe a set of services (including user interfaces)/containers to solve a business problem including appropriate patterns for Polyglot Persistence to provide the right data structure, such as graph, columnar or hash map, for enabling interactive and fast analytics for the users based on SQL and NoSQL databases (see above)

  • Continuous Delivery that describe how Enterprise applications are delivered to production ensuring the quality of them (e.g. Jenkins, Sonarqube, Gradle, Puppet etc).