Collaborative Data Science: About Storing, Reusing, Composing and Deploying Machine Learning Models

Why is this important?

Machine Learning has re-emerged in recent years as new Big Data platforms provide means to use them with more data, make them more complex as well as allowing combining several models to make an even more intelligent predictive/prescriptive analysis. This requires storing as well as exchaning machine learning models to enable collaboration between data scientists and applications in various environments. In the following paragraphs I will present the context of storing and deploying machine learning model, describe the dimensions into which model storage and deployment frameworks can be described, classify existing frameworks in this context and conclude with recommendations.


Machine learning models usually describe mathematical equations with special parameters, e.g.

y = a*x +b with y as the output value, x as the input value and a/b are parameters.

The values of those parameters are usually calculated using an algorithm that takes training data as input. Based on the training data the parameters are calculated to fit the mathematical equation to the data. Then, one can provide an observation to the model and it predicts the output related to the observation. For instance, given a customer with certain attributes (age, gender etc.) it can predict if the customer will buy the product on the web page.

At the same time, as machine learning models grew more complex, they were used by multiple people or even developed jointly as part of large machine learning pipelines – a phenomena commonly known as data science.

This is a paradigm shift from earlier days where everyone mostly worked in isolation and usually one person had a good idea what an analysis was about.

While it is already a challenge to train and evaluate a machine learning model, there are also other difficult tasks to consider given this context:

  • Loading/Storing/Composing different models in an appropriate format for efficient usage by different people and applications

  • Reusing models created in one platform on another platform with a different technology and/or capacity considerations in terms of hardware resources

  • Exchanging models between different computing environments within one enterprise, e.g. to promote models from development to production without the need to deploy potential risky code in production

  • Discussing and evaluating different models by other people

  • Offering pre-trained models in market places so enterprises can take/buy them and integrate then together with other prediction models in their learning pipeline

Ultimately, there is a need to share those models with different people and embed them in complex machine learning pipelines.

Achieving those tasks is critical to understand how machine learning models evolve and use the latest technologies to gain superior competitive advantages.

We describe the challenges in more details and then follow up how technologies, such as PMML or software container, can address them as well as how they are limited.

Why are formats for machine learning models difficult?

  • Variety of different types of models, such as discriminative and generative, that can be stored. Examples are linear regression, logistic regression, support vector machines, neural networks, hidden Markov models, regenerative processes and many more
  • An unambiguous definition of metadata related to models, such as type of model, parameters, parameter ontologies, structures, input/output ontologies, input data types, output data types, fitness/quality of the trained model and calculations/mathematical equations, needs to be taken into account

  • Some models are very large with potentially millions/billions of features. This is not only a calculation problem for prediction, but also demands answers on how such models should be stored for most efficient access.

  • Online machine learning, ie machine learning models that are retrained regularly, may need additional meta-data definitions, such as which data has been applied to them when, what data should be applied to them from the past, if any, and how frequently they should be updated

  • Exchange of models between different programming languages and systems is needed to evolve them to newest technology

  • Some special kind of learning models, e.g. those based on graph models, might have a less efficient matrix representation and a more efficient one based on lists. Although there are compression algorithms for sparse matrixes, they might not be as efficient for certain algorithms as lists

  • Models should be easy to version

  • Execution should be as efficient as possible

Generic Ways on Managing Machine Learning Models

We distinguish storage approaches for machine learning models across the following dimensions:

– Low ambiguity / high ambiguity

– Low flexibility / high flexibility

Ideally, a model has low ambiguity and high flexibility. It is very clear (low ambiguity) what the model articulates, so it can be easily shared, reused, understood and integrated (possibly automatically) in complex machine learning pipelines. High ambiguity corresponds to a black-box approach: some code is implemented, but nobody knows what it does, what are the underlying scientific/domain/mathematical/training assumptions/limitations. This makes those models basically useless, because you do not know their impact on your business processes.

Furthermore, one can articulate all possible models of any size until now as well as in the future, which correspond to high flexibility.

Obviously, one may think that low ambiguity and high flexibility is the ideal storage format. However, this introduces also complexity and a much higher effort to master it. In the end it always depends on the use case and the people as well as applications working with the model.

In the following diagram you see how different model storage formats could be categorized across different dimensions.

ML Models ambiguity vs flexibility

In the following we describe in more detail what these storage formats are and how I came up with the categorization:

CSV (Comma-Separated Values) and other tabular formats (e.g. ORC, Parquet, Avro):

Most analytical tools allow to store machine learning models in CSV or other tabular formats. Although many analytical tools can process CSV files, the CSV or other tabular formats do not adhere to a standard on how columns (parameters of the model) should be named, how data types (e.g. doubles) are represented and there is no standard on how metadata should be described. It does not describe anyway on how it can be loaded/processed or any computations to be performed. In virtually all cases the CSV format requires for each tool to implement a custom ETL process to use it as a model when loading/storing it. Hence, I decided it is low flexibility, because any form of computation is defined outside the CSV or other tabular format. One advantage with respect to flexibility is that with CSV and much more with specialized tabular formats (ORC, Parquet etc.) one can store usually very large models. In conclusion is categorized as High Ambiguity and Low flexibility.

PMML (Predictive Model Markup Language):

PMML exists already since 1997 and is supported by many commercial and open source tools (Apache Flink, Apache Spark, Knime, TIBCO Sportfire, SAS Enteprise Miner, SPSS Clementime, SAP Hana). PMML is based on XML (eXtensible Markup Language) and is articulated as an XML Schema. Hence, it reduces significantly ambiguity by providing a meta model around how transformations, models are described. Although this meta model is very rich, it does include only a subset of algorithms (many popular ones though) and it cannot be easily extended with new transformations or models that are then automatically understand by all tools. Furthermore, the meta model does not allow to articulate on which data the model was trained or on which ontology/concepts the input, output data is based. The possible transformations and articulated models do make it more flexible then pure tabular formats, but since it is based on XML it is not suitable for very large models containing a lot of features.

PFA (Portable Format for Analytics):

PFA is a more recent storage format compared to PMML and appeared around 2008. That means also that contrary to PMML it includes design considerations for “Big Data” volumes by taking into account Big Data platforms. Its main purpose is to exchange, store and deploy statistical models developed in one platform in another platform. For instance, one may write a trained model in Python and use it for predictions in a Java application running on Hadoop. Another example is that a developer trains the model in Python in the development environment and stores it in PFA to deploy it securely in production where it is run in a security-hardened Python instance. As you see it is already very close to the use cases described above. Additionally it takes into account Big Data aspects by storing model data itself in AVRO format. The nice thing is that you can actually develop your code in python/Java etc. and then let a library convert it to PFA, ie you do not need to know the complex and little bit cumbersome syntax of PFA). As such it provides a lot of means to reduce ambiguity by defining a standard and a large set of conformance checks towards the standard. This means if someone develops PFA support for a specific platform/library then it can be ensured that it adheres to the standard. However, ambiguity cannot be estimated as very low, because it has no standardized means to describe input and output data as part of ontologies or fitness/underly training assumptions. PFA supports definition of a wide range of existing models, but also new ones by defining actions and control flow/data flow operators as well as a memory model. However, it is not as flexible as e.g. developing a new algorithm that specially takes into account specific GPU features to run most-efficiently. Although you can define such an algorithm in PFA, the libraries used to interpret PFA will not know how to optimize this code for GPUs or distributed GPUs given the PFA model. Nevertheless, for the existing predefined models they can of course derive a version that runs well on GPUs. In total it has between low – medium ambiguity and high – medium flexibility.

ONNX (Open Neural Network Exchange Format):

ONNX is another format for specifying storage of machine learning models. However, its main focus are neural networks. Furthermore, it has an extension for “classical” machine learning models called ONNX-ML. It supports different frameworks (e.g Caffe2, Pytorch, Apple CoreML, TensorFlow) and runtimes (e.g. Nvidia, Vespa). It is mostly Python-focused, but some frameworks, such as Caffe2 offer a C++ binding. Storage of ML models is specified in protobuf, which offers itself already a wide tool support, but is of course not ML specific. It offers description of meta data related to a model, but in a very generic sense of key value pairs, which is not suitable to describe ontologies. It allows to specify various operators that are composed by graphs describing the data flow. Datatypes that are used as part of input and output specifications are based on protobuf datatypes. Contrary to PFA ONNX does not provide a memory model. However, similarly to PFA it does not allow the full flexibility, e.g. to write code in GPUs. In total it has between low – medium ambiguity and between high – medium flexibility, but ambiguity and flexibility are a little bit lower than PFA.

Keras – HDF5

Keras stores a machine learning model in HDF5, which is a dedicated format for „managing extremely large and complex data collections“. HDF5 itself supports many language ranging from Python over C to Java. However, Keras is mostly a Python library. HF5 claims to be a portable file format and suitable for high performance as it includes special time and storage space optimizations. HDF5 itself is not very well supported by Big Data platforms. However, Keras stores in HDF5 architecture of the model, weights of the model, training configuration and the state of the optimizer to allow resume training were it was left off. This means contrary to simply using a tabular format as described before, it sets a standard for expressing models in a tabular format. It does not store itself training data or any more meta data beyond the previously described items. As such it has from medium to high ambiguity. Flexibility is between low and medium, because it can describe more easily models or state of the optimizer.

Tensorflow format

Tensorflow has its own format for loading and storing a model, which includes variables, the graph and graph metadata. Tensorflow claims the format is language-neutral and recoverable. However, it is mostly used in the Tensorflow library. It provides only few possibilities to express a model. As such it has high – medium ambiguity. Flexibility is higher than CSV and ranges from low to medium.

Apache Spark Internal format for storing models (pipelines)

Apache Spark offers storing a pipeline (representing a model or a combination of models) in its own serialization format that can be only used within Apache Spark. It is based on a combination of JSON describing metadata of the model/pipeline and Parquet for storing model data (weights etc.) itself. It is limited to the models available in Apache Spark and cannot be extended to additional models easily (expect by extending Apache Spark). As such it ranges between high to medium ambiguity. Flexibility is limited between low and medium flexibility, because it requires Apache Spark to run and there is limited to the models offered by Apache Spark. Clearly, one benefit is that it can store compositions of models.

Theano – Python serialization (Pickle)
Theano offers Python serialization (“Pickle”). This means nearly all (with some restriction) that can be expressed in Python and its runtime data structures can be stored/loaded. Python serialization – as any other programming language serialization, such as Java – is very storage/memory hungry and slow. Additionally, the Keras documentation (see above) does not recommend it. It has also serious security issues when bringing models from development to production (e.g. someone can put anything there even things that are not related to machine learning and can exploit security holes with confidential data in production). Furthermore, serialization between different Python versions might be incompatible.

The ambiguity is low to medium, because basically only programming language concepts can be described. Metadata, ontologies etc. cannot be expressed easily and a lot of unnecessary Python-specific information is stored. However, given that it offers the full flexiblity of Python it ranges from medium to high flexibility.

Software Container

Some data science tools allow to define a model in a so-called software container (e.g. implemented in Docker). These are packages that can be easily deployed and orchestrated. They basically allow to contain any tool one wants. This clearly provides a huge flexibility to a data scientists, but at the cost that usually the software containers are not production ready as they are provided by data scientists, who don’t have the same skill as enterprise software developers. Usually they lack an authorization and access model or any hardening, which makes them less useful for confidential or personal data. Furthermore, if data scientists can install any tool then this leads to a large zoo of different tools and libraries, which are impossible to maintain, upgrade or apply security fixes. Usually only the data scientist that created them knows the details on how the container and the containing tools are configured making it difficult for others to reuse it or to scale it to meet new requirements. Containers may contain data, but this is usually not recommended for data that changes (e.g. models etc.). In these cases one needs to link a permanent storage to the container. Of course, the model format itself is not predefined – any model format maybe used depending on the tools in the container.

As such they don’t provide any mean to express any information of the model, which means they have a very high ambiguity. However, they have a high flexibility.

Jupyter Notebooks

Jupyter notebooks are basically editable webpages in which the data scientist can write text that describes code (e.g. in Python) that is executable. Once executed the page will be rendered with the results from the executed code. These can be tables, but also graphs. As such, notebooks can support various programming languages or even mix different programming languages. Execution depends on data stored outside the notebook on a storage in any format that is supported by the underlying programming language.

Descriptions can be as rich, but they are described natural language and thus difficult to process by an application, e.g. to reuse it in another context, or to integrate them into a complex machine learning pipeline. Even for other data scientists this can be difficult if the descriptions are not adequate.

Notebooks can be more understood in the scientific context, ie writing papers and publishing them for review, which does not address all the use cases described above.

As such it provides high flexibility and medium to high ambiguity.


I described in this blog post the importance of the storage format for machine learning models:

  • Bring machine learning models from the data scientist to a production environment in a secure and scalable manner where they are reused by applications and other data scientists

  • Sharing and using machine learning models cross systems and organizational boundaries

  • Offering pretrained machine learning models to a wide range of customers

  • (Automatically) composing different models to create a new more powerful combined model

We have seen many different solutions across the dimensions flexibility and ambiguity. There is not one solution that fits it all for all use cases. This means there is no perfect standard solution. Indeed, an organization will likely employ two or more approaches or even potentially combine them. I see in the future four major directions:

  • Highly standardized formats, such as the portable format for analytics (PFA), that can be used across applications and thus data scientists using them

  • Flexible descriptive formats, such as notebooks, that are used among data scientists

  • A combination of flexible descriptive formats and highly standardized formats, such as using PFA in an application that is visualized in a Notebook at different stages of the machine learning pipeline

  • An extension of existing formats towards online machine learning, ie updatetable machine learning models in streaming applications

Automated Machine Learning (AutoML) and Big Data Platforms

Although machine learning exists already since decades, the typical data scientist – as you would call it today – would still have to go through a manual labor-intensive process of extracting the data, cleaning, feature extraction, regularization, training, finding the right model, testing, selecting and deploying it. Furthermore, for most machine learning scenarios you do not use one model/algorithms but evaluate a plethora of different algorithms to find one suitable for the given data and and use case. Nowadays a lot of data is available under the so-called Big Data paradigm introducing additional challenges of mastering machine learning on distributed computing platforms.

This blog post investigates on how to ease the burden on the data scientists of manual labor-intensive model evaluation by presenting insights on the recent concept of automated machine learning (AutoML) and if it can be adapted to Big data platforms.

What is AutoML

Machine learning is about learning and making prediction from data. This can be useful in many contexts, such as autonomous robots, smart homes, agriculture or financial markets. Machine Learning – despite the word „machine“ in its name – is mostly a manual process that requires a highly skilled person to execute. Although the learning part is rather automated, this person needs to extract data, transform it in potentially different alternative ways, feed it into many alternative machine learning algorithms as well as using different parameters for the same algorithm, evaluate the quality of the generated prediction model and finally deploy this model for others so that they can make their own predictions without going through this labor-intensive process themselves. This is illustrated in the following figure:
AutoML Flow Diagram(1)

Given this background, it comes at no surprise that huge market places have been created where people sell and buy trained machine learning models. For example, the Azure Machine Learning Market place, the Amazon AWS Artificial Intelligence market place, the Algorithmia Marketplace, the caffe model zoo, deeplearning4j models, mxnet model zoo, tensor flow models or the Acumos Marketplace (based on the open source software).

Nevertheless, most organizations that wants to keep up with the competition in these market places or that do not want to rely on market places due to unique problems in their organizations have still the issue to find skilled persons that create prediction models. Furthermore, machine learning models become so complex that they are not the outcome of a single person, but a team that needs to ensure consistent quality.

Here AutoML comes into play. It can support highly skilled machine learning persons, but also non-skilled machine learning persons to create their own prediction models in a transparent manner. Ideally this is done in an automated fashion for the whole machine learning process, but contemporary technology focuses mostly on the evaluation of different models, a set of suitable parameters for these models („hyperparameter optimization“) and automated selection of the „winning“ (or best) model(s) (this is highlighted in the previous figure in green). In case of several models a ranking is created on a so-called “leaderboard”. This has to be done in a given time budget, ie one has only limited time and resources. Potentially several models could be combined (aka deep learning) in more advanced solutions, but this is currently in its infancy.

Some observations here for this specific focus:

  • AutoML does not invent new models. It relies on an exisiting repository of algorithms that it iterates and tries out with different parameters. In itself AutoML can be seen as a “meta-algorithm” over these models.
  • Similarly it relies on an existing repository of tests that determine suitability of a model for a given problem.
  • It requires clever sampling (size of sample, randomness) of the underlying training data, otherwise it may take very long to evaluate the models or simply the wrong data is used. A preselection of the data by a person is still needed, although the person does not require as much machine-learning specific skills.
  • A person still needs to determine for the winning model if it makes sense what it predicts. For this the person does not need machine learning skills, but domain specific skills. For instance, a financial analyst can determine if from a dataset of financial transaction attributes can predict fraud or not.

Big Data platforms

The emergence of the open source Hadoop platform in 2006 introduced Big Data platforms on commodity hardware and networkl clusters to the world. Few years later Hadoop was adopted for data analytics in several data analysis organizations. The main focus of Hadoop was to enable analytic tasks on large dataset in a reliable manner it was not possible before. Meanwhile further improved platforms have been created, such as Apache Flink or Apache Spark, that focus not only on processing large data volumes, but processing them also faster by employing various optimization techniques.

Those platforms employ several nodes that communciate over a network to execute a task in a distributed manner. This imposes some challenges for machine learning, such as:

  • Standard machine learning algorithms are not designed to be executed over a distributed network of nodes.
  • Not all machine learning algorithms can be converted into a distributed one. For instance, if you need to estimate the parameters of a model then gradient descent might require a lot of memory on a single node. Hence, other estimation methods need to be used to work in parallel on different nodes.

This led to specific machine learning libraries, such as Flink ML or Spark Mlib, for those platforms that supported only a dedicated subset of algorithms that can be executed efficiently and effectively over a given set of nodes communicating via the network.

Tool support for AutoML

AutoML can be very useful for your organization. Amongst others the following tools exist.

Tool Description Supported Models Supported Hyperparameter optimization
Auto-SKLearn Automated machine-learning toolkit to be used in lieu of the non-automed scikit-learn. Budget can be defined by time and memory as well as search space. Automated preprocessing can be defined. Multicore processing is supported for some of the algorithms. Multiple classifiers and regressors as well as combinations (ensemble construction) Bayesian Optimization
TPOT Automated machine-learning toolkit offering various automated preprocessors (e.g. Standard Scaler, Principal Component Analysis). Multiple classifiers and regressors as well as combinations Genetic Programing
Auto-Weka Automated machine-learning toolkit. Budget can be defined by time and memory as well as search space. Multiple classifiers and regressors as well as combinations Bayesian Optimization
Devol Automated machine-learning toolkit for deep neural network architectures. Expects the data to be prepared and encoded. Sees itself more as support for experienced data scientists. Neural Networks and combinations. Genetic Programing
Machine.js/Auto-ML Automated machine learning kit based on auto-ml. Expects the data to be prepared and encoded. Various classifiers and regressor as well as Neural networks based on the Keras library. Supports combinations. Genetic Programing / Gridsearch

Most of these tools support only one method for hyperparameter optimization. However there are several methods. Some models do not require hyperparameter optimization, because they can derive optimal hyperparameter from the trained data. Unfortunately, this is currently integrated in none of the tools.

Those tools might not always be very end user friendly and you still need to deploy them in the whole organization as fat clients instead of light-weight HTML5 browser applications. As an alternative popular cloud provider integrating more assistants in the cloud that help you with your machine learning process.

AutoML on Big Data Platforms

The aforementioned tools have not been primarily designed for Big Data platforms. They usually are based on Python or Java, so one could use them with the Python or Java-bindings of those platforms (cf. Flink or Spark). One could use the available data sources (and efficient data formats such as ORC/Parquet) and sampling functionality of those platforms (e.g. sampling in Flink or sampling in Spark) and feed it into the tools that could even run on the cluster. However, they would only use one node and the rest of the cluster would not be utilized. Furthermore, the generated models are not necessarily compatible with the models provided by the Big Data platforms. Hence, one has to write a manual wrapper around those models so they can be distributed over the cluster. Nevertheless, also these models would only use one node.

This is not necessarily bad, because usually data scientists will not run one dataset to evaluate with AutoML but multiple datasets so you can utilize the whole cluster by running several AutoML processes. However, it also means data size as well as budget for time and memory is limited to one node, which might not be sufficient for all machine learning problems. Another interesting aspect could be to run one or more winning models over a much larger dataset to evaluate it in more detail or to optimize it even more. This would again require a more deep integration of the AutoML tools and Big Data platforms.

H2O AutoML is a recent (March 2018) example on how to provide AutoML on Big Data platforms, but this has currently similar limitations as described before with respect to the Big Data platform. The only advantage currently is that the models can be generated are compatible with the machine learning APIs of the Big Data platforms.

Apache Spark has some methods for hyperparameter tuning, but they are limited to a pipeline or model and do not cover aspects of other AutoML solutions, such as comparing different models. Furthermore, it only evaluates out a list of given sets of parameters and no time or cost budget definition is possible. This would have to be implemented by the user.

One can see that AutoML and Big Data platforms can benefit from a more tighter integration in the future. It would then also be more easy to leverage all the data in your data lake without extracting it and process it locally. At the moment, although some machine learning models are supported by Big Data platforms (with some caveats related to distributed processing) not all functionality of AutoML is supported (e.g. hyperparameter tuning or genetic programing).

Google AutoML provides an automated machine learning solution in the cloud. It augments it with pre-trained models (e.g. on image data), so not everyone has to spend time again to train models.


AutoML is a promising tool to facilitate the work of less skilled and very skilled machine learning persons to enable them to have more time focusing on their work instead of the manual error-prone labour-intensive machine learning process. You can make your organisation on-board on machine-learning even if you are in your early machine learning stages and facilitate learning of many people on this topic.

However, it does not mean that people without knowledge can use it out of the box. They should have a least a basic knowledge on how machine learning works in general. Furthermore, they should be domain experts in your organizations domain, e.g. a financial analyst in banks or an engineer in the car industry. They still need to decide which data to input to AutoML and if the model learned by AutoML is useful for your business purposes. For example, it does not make sense to put all the data of the world related to problematic situations for autonomous cars into an AutoML and expect that it can solve all problematic situation as best as possible. Moreover, it is more likely to be country/region-specific so it may make more sense to have several AutoML runs with different countries/regions to develop specific models for countries/regions. Other datasets, such as blockchains/cryptoledgers are more complex and require currently a manual preprocessing and selection.

Another example is known from spurious correlations, ie correlations that exists, but do not imply causality. In all this case you still need a domain expert that can judge if the model is a useful contribution for the organization.

All these things are related to the no free lunch theorem.

Even for highly-skilled machine learning persons AutoML can make sense. No-one can know all particularities of machine learning models and it simply takes a lot of time to evaluate them all manually. Some may even have their favorite models, which may mean other models are not evaluated although they may fit better. Or they simply do not have time for manual evaluation, so a preselection of candidate models can also be very useful.

One open issue is still how much time and resources you should let AutoML spend on a specific dataset. This is not easy to answer and here you may still need to experiment if the results are bad then you need to spend probably more.

Nevertheless, AutoML as a recent field still has a lot of room for improvements, such as full integration in Big Data machine learning platforms or support of more hyperparameter tuning algorithms as well as more user-friendly use interfaces, as pioneered by some cloud services.

Then, those models should be part of a continuous delivery pipeline. This requires unit testing and integration testing to avoid that the model has obvious errors (e.g. always returns 0) or that it does not fit into the environment in which it is used for prediction (e.g. web service cannot be called or model cannot be opened in R). Integration machine learning models into continuous delivery pipelines for productive use has not recently drawn much attention, because usually the data scientists push them directly into the production environment with all the drawbacks this approach may have, such as no proper unit and integration testing.