Automated Machine Learning (AutoML) and Big Data Platforms

Although machine learning exists already since decades, the typical data scientist – as you would call it today – would still have to go through a manual labor-intensive process of extracting the data, cleaning, feature extraction, regularization, training, finding the right model, testing, selecting and deploying it. Furthermore, for most machine learning scenarios you do not use one model/algorithms but evaluate a plethora of different algorithms to find one suitable for the given data and and use case. Nowadays a lot of data is available under the so-called Big Data paradigm introducing additional challenges of mastering machine learning on distributed computing platforms.

This blog post investigates on how to ease the burden on the data scientists of manual labor-intensive model evaluation by presenting insights on the recent concept of automated machine learning (AutoML) and if it can be adapted to Big data platforms.

What is AutoML

Machine learning is about learning and making prediction from data. This can be useful in many contexts, such as autonomous robots, smart homes, agriculture or financial markets. Machine Learning – despite the word „machine“ in its name – is mostly a manual process that requires a highly skilled person to execute. Although the learning part is rather automated, this person needs to extract data, transform it in potentially different alternative ways, feed it into many alternative machine learning algorithms as well as using different parameters for the same algorithm, evaluate the quality of the generated prediction model and finally deploy this model for others so that they can make their own predictions without going through this labor-intensive process themselves. This is illustrated in the following figure:
AutoML Flow Diagram(1)

Given this background, it comes at no surprise that huge market places have been created where people sell and buy trained machine learning models. For example, the Azure Machine Learning Market place, the Amazon AWS Artificial Intelligence market place, the Algorithmia Marketplace, the caffe model zoo, deeplearning4j models, mxnet model zoo, tensor flow models or the Acumos Marketplace (based on the open source software).

Nevertheless, most organizations that wants to keep up with the competition in these market places or that do not want to rely on market places due to unique problems in their organizations have still the issue to find skilled persons that create prediction models. Furthermore, machine learning models become so complex that they are not the outcome of a single person, but a team that needs to ensure consistent quality.

Here AutoML comes into play. It can support highly skilled machine learning persons, but also non-skilled machine learning persons to create their own prediction models in a transparent manner. Ideally this is done in an automated fashion for the whole machine learning process, but contemporary technology focuses mostly on the evaluation of different models, a set of suitable parameters for these models („hyperparameter optimization“) and automated selection of the „winning“ (or best) model(s) (this is highlighted in the previous figure in green). In case of several models a ranking is created on a so-called “leaderboard”. This has to be done in a given time budget, ie one has only limited time and resources. Potentially several models could be combined (aka deep learning) in more advanced solutions, but this is currently in its infancy.

Some observations here for this specific focus:

  • AutoML does not invent new models. It relies on an exisiting repository of algorithms that it iterates and tries out with different parameters. In itself AutoML can be seen as a “meta-algorithm” over these models.
  • Similarly it relies on an existing repository of tests that determine suitability of a model for a given problem.
  • It requires clever sampling (size of sample, randomness) of the underlying training data, otherwise it may take very long to evaluate the models or simply the wrong data is used. A preselection of the data by a person is still needed, although the person does not require as much machine-learning specific skills.
  • A person still needs to determine for the winning model if it makes sense what it predicts. For this the person does not need machine learning skills, but domain specific skills. For instance, a financial analyst can determine if from a dataset of financial transaction attributes can predict fraud or not.

Big Data platforms

The emergence of the open source Hadoop platform in 2006 introduced Big Data platforms on commodity hardware and networkl clusters to the world. Few years later Hadoop was adopted for data analytics in several data analysis organizations. The main focus of Hadoop was to enable analytic tasks on large dataset in a reliable manner it was not possible before. Meanwhile further improved platforms have been created, such as Apache Flink or Apache Spark, that focus not only on processing large data volumes, but processing them also faster by employing various optimization techniques.

Those platforms employ several nodes that communciate over a network to execute a task in a distributed manner. This imposes some challenges for machine learning, such as:

  • Standard machine learning algorithms are not designed to be executed over a distributed network of nodes.
  • Not all machine learning algorithms can be converted into a distributed one. For instance, if you need to estimate the parameters of a model then gradient descent might require a lot of memory on a single node. Hence, other estimation methods need to be used to work in parallel on different nodes.

This led to specific machine learning libraries, such as Flink ML or Spark Mlib, for those platforms that supported only a dedicated subset of algorithms that can be executed efficiently and effectively over a given set of nodes communicating via the network.

Tool support for AutoML

AutoML can be very useful for your organization. Amongst others the following tools exist.

Tool Description Supported Models Supported Hyperparameter optimization
Auto-SKLearn Automated machine-learning toolkit to be used in lieu of the non-automed scikit-learn. Budget can be defined by time and memory as well as search space. Automated preprocessing can be defined. Multicore processing is supported for some of the algorithms. Multiple classifiers and regressors as well as combinations (ensemble construction) Bayesian Optimization
TPOT Automated machine-learning toolkit offering various automated preprocessors (e.g. Standard Scaler, Principal Component Analysis). Multiple classifiers and regressors as well as combinations Genetic Programing
Auto-Weka Automated machine-learning toolkit. Budget can be defined by time and memory as well as search space. Multiple classifiers and regressors as well as combinations Bayesian Optimization
Devol Automated machine-learning toolkit for deep neural network architectures. Expects the data to be prepared and encoded. Sees itself more as support for experienced data scientists. Neural Networks and combinations. Genetic Programing
Machine.js/Auto-ML Automated machine learning kit based on auto-ml. Expects the data to be prepared and encoded. Various classifiers and regressor as well as Neural networks based on the Keras library. Supports combinations. Genetic Programing / Gridsearch

Most of these tools support only one method for hyperparameter optimization. However there are several methods. Some models do not require hyperparameter optimization, because they can derive optimal hyperparameter from the trained data. Unfortunately, this is currently integrated in none of the tools.

Those tools might not always be very end user friendly and you still need to deploy them in the whole organization as fat clients instead of light-weight HTML5 browser applications. As an alternative popular cloud provider integrating more assistants in the cloud that help you with your machine learning process.

AutoML on Big Data Platforms

The aforementioned tools have not been primarily designed for Big Data platforms. They usually are based on Python or Java, so one could use them with the Python or Java-bindings of those platforms (cf. Flink or Spark). One could use the available data sources (and efficient data formats such as ORC/Parquet) and sampling functionality of those platforms (e.g. sampling in Flink or sampling in Spark) and feed it into the tools that could even run on the cluster. However, they would only use one node and the rest of the cluster would not be utilized. Furthermore, the generated models are not necessarily compatible with the models provided by the Big Data platforms. Hence, one has to write a manual wrapper around those models so they can be distributed over the cluster. Nevertheless, also these models would only use one node.

This is not necessarily bad, because usually data scientists will not run one dataset to evaluate with AutoML but multiple datasets so you can utilize the whole cluster by running several AutoML processes. However, it also means data size as well as budget for time and memory is limited to one node, which might not be sufficient for all machine learning problems. Another interesting aspect could be to run one or more winning models over a much larger dataset to evaluate it in more detail or to optimize it even more. This would again require a more deep integration of the AutoML tools and Big Data platforms.

H2O AutoML is a recent (March 2018) example on how to provide AutoML on Big Data platforms, but this has currently similar limitations as described before with respect to the Big Data platform. The only advantage currently is that the models can be generated are compatible with the machine learning APIs of the Big Data platforms.

Apache Spark has some methods for hyperparameter tuning, but they are limited to a pipeline or model and do not cover aspects of other AutoML solutions, such as comparing different models. Furthermore, it only evaluates out a list of given sets of parameters and no time or cost budget definition is possible. This would have to be implemented by the user.

One can see that AutoML and Big Data platforms can benefit from a more tighter integration in the future. It would then also be more easy to leverage all the data in your data lake without extracting it and process it locally. At the moment, although some machine learning models are supported by Big Data platforms (with some caveats related to distributed processing) not all functionality of AutoML is supported (e.g. hyperparameter tuning or genetic programing).

Google AutoML provides an automated machine learning solution in the cloud. It augments it with pre-trained models (e.g. on image data), so not everyone has to spend time again to train models.

Conclusion

AutoML is a promising tool to facilitate the work of less skilled and very skilled machine learning persons to enable them to have more time focusing on their work instead of the manual error-prone labour-intensive machine learning process. You can make your organisation on-board on machine-learning even if you are in your early machine learning stages and facilitate learning of many people on this topic.

However, it does not mean that people without knowledge can use it out of the box. They should have a least a basic knowledge on how machine learning works in general. Furthermore, they should be domain experts in your organizations domain, e.g. a financial analyst in banks or an engineer in the car industry. They still need to decide which data to input to AutoML and if the model learned by AutoML is useful for your business purposes. For example, it does not make sense to put all the data of the world related to problematic situations for autonomous cars into an AutoML and expect that it can solve all problematic situation as best as possible. Moreover, it is more likely to be country/region-specific so it may make more sense to have several AutoML runs with different countries/regions to develop specific models for countries/regions. Other datasets, such as blockchains/cryptoledgers are more complex and require currently a manual preprocessing and selection.

Another example is known from spurious correlations, ie correlations that exists, but do not imply causality. In all this case you still need a domain expert that can judge if the model is a useful contribution for the organization.

All these things are related to the no free lunch theorem.

Even for highly-skilled machine learning persons AutoML can make sense. No-one can know all particularities of machine learning models and it simply takes a lot of time to evaluate them all manually. Some may even have their favorite models, which may mean other models are not evaluated although they may fit better. Or they simply do not have time for manual evaluation, so a preselection of candidate models can also be very useful.

One open issue is still how much time and resources you should let AutoML spend on a specific dataset. This is not easy to answer and here you may still need to experiment if the results are bad then you need to spend probably more.

Nevertheless, AutoML as a recent field still has a lot of room for improvements, such as full integration in Big Data machine learning platforms or support of more hyperparameter tuning algorithms as well as more user-friendly use interfaces, as pioneered by some cloud services.

Then, those models should be part of a continuous delivery pipeline. This requires unit testing and integration testing to avoid that the model has obvious errors (e.g. always returns 0) or that it does not fit into the environment in which it is used for prediction (e.g. web service cannot be called or model cannot be opened in R). Integration machine learning models into continuous delivery pipelines for productive use has not recently drawn much attention, because usually the data scientists push them directly into the production environment with all the drawbacks this approach may have, such as no proper unit and integration testing.

Advertisements

Lambda, Kappa, Microservice and Enterprise Architecture for Big Data

A few years after the emergence of the Lambda-Architecture several new architectures for Big Data have emerged. I will present and illustrate their use case scenarios. These architectures describe IT architectures, but I will describe towards the end of this blog the corresponding Enterprise Architecture artefacts, which are sometimes referred to as Zeta architecture.

Lambda Architecture

I have blogged before about the Lambda-Architecture. Basically this architecture consists of three layers:

  • Batch-Layer: This layer executes long-living batch-processes to do analyses on larger amounts of historical data. The scope is data from several hours to weeks up to years. Here, usually Hadoop MapReduce, Hive, Pig, Spark or Flink are used together with orchestration tools, such as Oozie or Falcon.

  • Speed-Layer/Stream Processing Layer: This layer executes (small/”mini”) batch-processes on data according to a time window (e.g. 1 minute) to do analyses on larger amounts of current data. The scope is data from several seconds up to several hours. Here one may use, for example, Flink, Spark or Storm.

  • Serving Layer: This layer combines the results from the batch and stream processing layer to enable fast interactive analyses by users. This layer leverages usually relational databases, but also NoSQL databases, such as Graph databases (e.g. TitanDB or Neo4J), Document databases (e.g. MongoDB, CouchDB), Column-Databases (e.g. Hbase), Key-Value Stores (e.g. Redis) or Search technologies (e.g. Solr). NoSQL databases provide for certain use cases more adequate and better performing data structures, such as graphs/trees, hash maps or inverse indexes.

In addition, I proposed the long-term storage layer to have an even cheaper storage for data that is hardly accessed, but may be accessed eventually. All layers are supported by a distributed file system, such as HDFS, to store and retrieve data. A core concept is that computation is brought to data (cf. here). On the analysis side, usually standard machine learning algorithms, but also on-line machine learning algorithms, are used.

As you can see, the Lambda-Architecture can be realized using many different software components and combinations thereof.

While the Lambda architecture is a viable approach to tackle Big Data challenges different other architectures have emerged especially to focus only on certain aspects, such as data stream processing, or on integrating it with cloud concepts.

Kappa Architecture

The Kappa Architecture focus solely on data stream processing or “real-time” processing of “live” discrete events. Examples are events emitted by devices from the Internet of Things (IoT), social networks, log files or transaction processing systems. The original motivation was that the Lambda Architecture is too complex if you only need to do event processing.

The following assumptions exists for this architecture:

  • You have a distributed ordered event log persisted to a distributed file system, where stream processing platforms can pick up the events

  • Stream processing platforms can (re-)request events from the event log at any position. This is needed in case of failures or upgrades to the stream processing platform.

  • The event log is potentially large (several Terabytes of data / hour)

  • Mostly online machine learning algorithms are applied due to the constant delivery of new data, which is more relevant than the old already processed data

Technically, the Kappa architecture can be realized using Apache Kafka for managing the data-streams, i.e. providing the distributed ordered event log. Apache Samza enables Kafka to store the event log on HDFS for fault-tolerance and scalability. Examples for stream processing platforms are Apache Flink, Apache Spark Streaming or Apache Storm. The serving layer can in principle use the same technologies as I described for the serving layer in the Lambda Architecture.

There are some pitfalls with the Kappa architecture that you need to be aware of:

  • End to end ordering of events: While technologies, such as Kafka can provide the events in an ordered fashion it relies on the source system that these events are indeed delivered in an ordered fashion. For instance, I had the case that a system in normal operations was sending the events in order, but in case of errors of communication this was not the case, because it stored the events it could not send and retransmitted them at a certain point later. Meanwhile if the communication was established again it send the new events. The source system had to be adapted to handle these situations correctly. Alternatively, you can only ensure a partial ordering using vector clocks or similar implemented at the event log or stream processing level.

  • Delivery paradigms on how the events are delivered (or fetched from) to the stream processing platform

    • At least once: The same event is guaranteed to be delivered once, but the same events might be delivered twice or more due to processing errors or communication/operation errors within Kafka. For instance, the stream processing platform might crash before it can marked events as processed although it has processed them before. This might have undesired side effects, e.g. the same event that “user A liked website W” is counted several times.

    • At most once: The event will be delivered at most once (this is the default Kafka behavior). However, it might also get lost and not be delivered. This could have undesired side effects, e.g. the event “user A liked website W” is not taken into account.

    • Once and only once: The event is guaranteed to be delivered once and only once. This means it will not get lost or delivered twice or more times. However, this is not simply a combination of the above scenarios. Technically you need to make sure in a multi-threaded distributed environment that an event is processed exactly once. This means the same event needs to be (1) only be processed by one sequential process in the stream processing platforms (2) all other processes related to the events need to be made aware that one of them already processes the event. Both features can be implemented using distributed system techniques, such as semaphores or monitors. They can be realized using distributed cache systems, such as Ignite, Redis or to a limited extent ZooKeeper. Another simple possibility would be a relational database, but this would quickly not scale with large volumes.
      • Needles to say: The source system must also make sure that it delivers the same event once and only once to the ordered event log.

  • Online machine learning algorithms constantly change the underlying model to adapt it to new data. This model is used by other applications to make predictions (e.g. predicting when a machine has to go into maintenance). This also means that in case of failure we may temporary have an outdated or simply a wrong model (e.g. in case of at least once or at most once delivery). Hence, the applications need to incorporate some business logic to handle this (e.g do not register a machine twice for maintenance or avoid permanently registering/unregistering a machine for maintenance)

Although technologies, such as Kafka can help you with this, it requires a lot of thinking and experienced developers as well as architects to implement such a solution. The batch-processing layer of the Lambda architecture can somehow mitigate the aforementioned pitfalls, but it can be also affected by them.

Last but not least, although the Kappa Architecture seems to be intriguing due to its simplification in comparison to the Lambda architecture, not everything can be conceptualized as events. For example, company balance sheets, end of the month reports, quarterly publications etc. should not be forced to be represented as events.

Microservice Architecture for Big Data

The Microservice Architecture did not originate in the Big Data movement, but is slowly picked up by it. It is not a precisely defined style, but several common aspects exist. Basically it is driven by the following technology advancements:

  • The implementation of applications as services instead of big monoliths
  • The emergence of software containers to deploy each of those services in isolation of each other. Isolation means that they are put in virtual environments sharing the same operating systems (i.e. they are NOT in different virtual machines), they are connected to each other via virtualized networks and virtualized storage. These containers leverage much better the available resources than virtual machines.
    • Additionally the definition of repositories for software containers, such as the Docker registry, to quickly version, deploy, upgrade dependent containers and test upgraded containers.
  • The deployment of container operating systems, such as CoreOS, Kubernetes or Apache Mesos, to efficiently manage software containers, manage their resources, schedule them to physical hosts and dynamically scale applications according to needs.
  • The development of object stores, such as OpenStack Swift, Amazon S3 or Google Cloud Storage. These object stores are needed to store data beyond the lifecycle of a software container in a highly dynamic cloud or scaling on-premise environment.
  • The DevOps paradigm – especially the implementation of continuous integration and delivery processes with automated testing and static code analysis to improve software quality. This also includes quick deliveries of individual services at any time independently of each other into production.

An example of the Microservice architecture is the Amazon Lambda platform (not to be confused with Lambda architecture) and related services provided by Amazon AWS.

Nevertheless, the Microservice Architecture poses some challenges for Big Data architectures:

  • What should be a service: For instance, you have Apache Spark or Apache Flink that form a cluster to run your application. Should you have for each application on them a dedicated cluster out of software container or should you provide a shared cluster of software containers. It can make sense to have the first solution, e.g. a dedicated cluster per application due to different scaling and performance needs of the application.
  • The usage of object stores. Object stores are needed as a large scale dynamically scalable storage that is shared among containers. However, currently there are some issues, such as performance and consistency models (“eventually consistent”). Here, the paradigm of “Bring Computation to Data” (cf. here) is violated. Nevertheless, this can be mitigated either by using HDFS as a temporal file system on the containers and fetching the data beforehand from the object store or use an in-memory caching solution, such as provided by Apache Ignite or to some extend Apache Spark or Apache Flink.

I see that in these environments the role of software defined networking (SDN) will become crucial not only in cloud data centers, but also on-premise data centers. SDN (which should NOT be confused with virtualized networks) enables centrally controlled intelligent routing of network flows as it is needed in dynamically scaling platforms as required by the Microservice architecture. The old decentralized definition of the network, e.g. in form of decentralized routing, does simply not scale here to enable optimal performance.

Conclusion

I presented here several architectures for Big Data that emerged recently. Although they are based on technologies that are already several years old, I observe that many organizations are overwhelmed with these new technologies and have some issues to adapt and fully leverage them. This has several reasons.

One tool to manage this could be a proper Enterprise Architecture Management. While there are many benefits of Enterprise Architecture Management, I want to highlight the benefit of managed of managed evolution. This paradigm enables to align business and IT, although there is a constant independent (and dependent) change of both with not necessarily aligned goals as illustrated in the following picture.

enterprise-architecture-managed-evolution

As you can see from the picture both are constantly diverging and Enterprise Architecture Management is needed to unite them again.

However, reaching managed evolution of Enteprise Architecture requires usually many years and business as well as IT commitment to it. Enterprise Architecture for Big Data is a relatively new concept, which is still subject to change. Nevertheless some common concepts can be identifed. Some people refer to Enterprise Architecture for Big Data also as Zeta Architecture and it does not only encompass Big Data processing, but in context of Microservice architecture also web servers providing the user interface for analytics (e.g. Apache Zeppelin) and further management workflows, such as backup or configuration, deployed in form of containers.

This enterprise architecture for Big Data describes some integrated patterns for Big Data and Microservices so that you can consistently document and implement your Lambda, Kappa, Microservice architecture or a mixture of them. Examples for artefacts of such an enterprise architecture are (cf. also here):

  • Global Resource Management to manage the physical and virtualized resources as well as scaling them (e.g. Apache Mesos and Software Defined Networking)

  • Container Management to deploy and isolate containers (e.g. Apache Mesos)

  • Execution engines to manage different processing engines, such as Hadoop MapReduce, Apache Spark , Apache Flink or Oozie

  • Storage Management to provide Object Storage (e.g. Openstack Swift), Cache Storage (e.g. Ignite HDFS Cache), Distributed Filesystem (e.g. HDFS) and Distributed Ordered Event Log (e.g. Kafka)

  • Solution architecture for one or more services that address one or more business problems. It should be separated from the enterprise architecture, which is focusing more on the strategic global picture. It can articulate a Lambda, Kappa, Microservice architecture or mixture of them.

  • Enterprise applications describe a set of services (including user interfaces)/containers to solve a business problem including appropriate patterns for Polyglot Persistence to provide the right data structure, such as graph, columnar or hash map, for enabling interactive and fast analytics for the users based on SQL and NoSQL databases (see above)

  • Continuous Delivery that describe how Enterprise applications are delivered to production ensuring the quality of them (e.g. Jenkins, Sonarqube, Gradle, Puppet etc).