Big Data – What is next? OLTP, OLAP, Predictive Analytics, Sampling and Probabilistic Databases

Big Data has matured over the last years and is becoming more and more a standard technology used in various industries. Coming from established concepts, such as OLAP or OLTP, in context of Big Data, I go in this blog post beyond them describing what is needed for next generation applications, such as autonomous cars, industry 4.0 and smart cities. I will cover three new aspects: (1) making the underlying technology of predictive analytics transparent to the data scientist (2) avoiding Big Data processing of one large scale dataset by employing sampling and probabilistic datastructures and (3) ensuring quality and consistency of predictive analytics using probabilistic databases. Finally, I will talk about how these aspects change the Big Data Lambda architecture and briefly address some technologies covering the new three aspects.

Big Data

Big Data has emerged over the last years as a concept to handle data that requires new data modeling concepts, data structures, algorithms and/or large-scale distributed clusters. This has several reasons, such as large data volumes, new analysis models, but also changing requirements in the light of new use cases, such as industry 4.0 and smart cities.

During investigations of these new use cases it quickly came apparent that current technologies, such as relational databases would not be sufficient to address the new requirements. This was due to inefficient data structures as well as algorithms for certain analytics questions, but also to the inherent limitations of scaling them.

Hence, Big Data technologies have been developed and are subject to continuous improvement for old and new use cases.

Big Online Transaction Processing (OLTP)

OLTP has been around for a long time and focuses on transaction processing. When the concept of OLTP emerged it has been usually a synonym for simply using relational databases to store various information related to an application – most people forgot that it was related to processing of transactions. Additionally, it was not about technical database transactions, but business transactions, such as ordering products or receiving money. Nevertheless, most relational databases secure business transactions via technical transactions by adhering to the ACID criteria.

Today OLTP is relevant given its numerous implementations in enterprise systems, such as Enterprise Resource Management systems, Customer Relationship Management systems or Supply Chain Management systems. Due to the growing complexity of international organisations these systems tend to have more and more data and – from a data volume point of view – they tend to generate a lot of data. For instance, large online vendor can have several exabyte of transaction data. Hence, Big Data happen also for OLTP. Particularly, if this data needs to be historized for analytical purposes (see next section).

However, one important difference from other systems is the access pattern: Usually, there are a lot of concurrent users, who are interested in a small part of the data. For instance, a customer relation agent adds some details about a conversation with a customer. Another example is that an order is updated. Hence, you need to be able to find/update a small data set in a much large data set. Different mechanismto handle a lot of data for OLTP usage exist in relational database systems since a long time.

Big Online Analytical Processing (OLAP)

OLAP has been around nearly as long as OLTP, because most analysis have been done on historized transactional data. Due to the historization and different analysis needs the amount of data is significant higher than in OLTP systems. However, OLAP has a different access pattern: Less concurrent users, but they are interested in the whole set of data, because they want to generate aggregated statistics for them. Hence, a lot of data is usually transferred into an OLAP system from different source systems and afterwards it is only read very often.

This has led very early to the development of special OLAP databases for storing data for multidimensional analysis in cubes to match the aforementioned access pattern. They can be seen as very early NoSQL databases, although they have not been described as such at this time, because the term NoSQL databases appeared only much later.

While data from OLTP systems have been originally the primary source for OLAP systems, new sources of data have appeared, such as sensor data or social network graphs. This data goes beyond the capability of OLTP or special OLAP databases and requires new approaches.

Going beyond OLTP and OLAP

Aspect 1: Predictive Analytics

Data scientists employing predictive analytics are using statistic and machine learning techniques to predict how a situation may evolve in the future. For example, they predict how the sales will evolve given existing sales and patterns. Some of these techniques exist already since decades, but only since recently they make more sense, because more data can be processed with Big Data technologies.

However, current Big Data technologies, such as Hadoop, are not transparent to the end user. This is not really an issue with the Big Data technologies themselves, but with the tools used for accessing and processing the data, such as R, Matlab or SAS.

They require that the end user thinks about writing a distributed analysis algorithms, e.g. via map/reduce programs in R or other languages to do their analysis. The standard library functions for statistics can be included in such distributed programs, but still the user has to think about how to design the distributed program. This is undesirable, because these users are usually not skilled enough to design them optimally. Hence, frustration with respect to performance and efforts is likely to occur.

Furthermore, organisations have to think about an integrated repository where they store these programs to enable reuse and consistent analytics. This is particularly difficult, because these programs are usually maintained by business users, who lack proper software management skills.

Unfortunately, it cannot be expected that the situation changes very soon.

Aspect 2: Sampling & Probabilistic Data Structures

Surprisingly often when we deal with Big Data, end users tend to execute queries over the whole set of data independent if it is has 1 million rows or 1 billion rows.

Sampling databases

While it is certainly possible to process a data set of nearly any size with modern Big Data technologies, one should carefully think if this is desired due to increased costs, time and efforts needed.

For instance, if I want to calculate the average value of all transactions then I can calculate the average of all transactions. However, I could take also a random sample of 5 % of the transactions and know that the average of this sample is correct with an error of +-1 % in comparison of the total population. For most decision making this is perfectly fine. However, I needed to process only a fraction of the data and can now do further analysis due to the saved time and resources. This may even lead to better informed decisions.

Luckily, there are already technologies allowing this. For example, BlinkDB, which allows – amongst others – the following SQL queries:

  • Calculate the average of the transaction values within 2 seconds with some error:SELECT avg(transactionValue) FROM table WITHIN 2 SECONDS
  • Calculate the average of the transaction values within given error boundaries :SELECT avg(transactionValue) FROM table ERROR 0.1 CONFIDENCE 95.0%

These queries executed over large-scale dataset are executed much faster than in any other Big Data technology not employing sampling methods.

Particularly for predictive algorithms this makes sense, because they have anyway underlying assumption about statistical errors., which can be easily integrated by a data scientists with errors from sampling databases.

Bloom filters

Probabilistic data structures, such as Bloom filters or HyperLoglog, are aiming in the same direction. They are more and more implemented in traditional SQL databases and NoSQL databases.

Bloom filters can tell you if an element is part of a set of elements without browsing through the set of elements by employing a smart hash structure. This means you can skip trying to access elements on disk which anyway do not exist. For instance, if you want to join two large datasets you need only to load the data for which there is a corresponding value in the other dataset. This dramatically improves the performance, because you need to load less data from slow storage.

However, bloom filters can only tell you if an element is definitely not in the set. This means it can only tell you with a certain probability if a given element is in the set. However, this is for the the given use cases of bloom filters no problem.


Hyperloglog structures allow you to count the number of unique elements in a set without storing the whole set.

For example, let us assume you want to count the unique listeners of a song on the web.

If you use traditional SQL technologies then you need to store for 5 million unique listeners (not uncommon on Spotify) 80 MB of data for one song. It takes also several seconds for each web site request just to do the count unique or to insert a new unique listener.

By using HyperLoglog you need to store at a maximum only few kilobytes of information (usually much less) and can read/update instantaneously the counted unique listeners.

Usually these results are correct within a minor configurable error margin, such as 0,12%.

Find some calculations in my lecture materials.

Aspect 3: Probablistic Databases

Your company has deployed a Big Data technology and uses it productively. Data scientists generate new statistical models on a daily basis all over the world based on your enormous data reservoir. However, all these models have only a very selective view on the real world and different data scientists use different methods and assumptions to do the same analysis, i.e. they have a different statistical view on the same object in the real world.

This is a big issue: Your organization has an inconsistent view on the market, because different data scientists may use different underlying data, assumptions and probabilities. This leads to a loss of revenue, because you may define contradictory strategies. For example, data scientist A may do a regression analysis on sales of ice cream based on surveys in North France with a population of 1000. Data scientist B may independently do a regression analysis on sales of ice cream in South Germany with a population of 100. Hence, both may come up with different prediction for ice cream sales in middle Europe, because they have different views on the world.

This is not an issue only with collaboration. Current Big Data technologies do not support a consistent statistical view of the world on your data.

This is where probabilistic databases will play a key role. They provide an integrated statistical view on the world and can be queried efficiently by employing new techniques, but still supporting SQL-style queries. For example one can query the location of a truck from a database. However, instead of just one location of one truck you can get several locations with different probabilities associated to them. Similarly you may join all the trucks with a certain probability being close to goods at a certain warehouse.

Current technologies are based on academic prototypes, such as MayBMS by Cornell University, BayesStore by University of California Berkeley or Trio by Standford University.

However, the technologies lack still commercial maturity and more research is needed.

Many organization are not ready for this next big step in Big Data and it is expected that this will take at least 5-10 years until the first are ready to employ such technology.

Context of the Lambda architecture

You may wonder how this all fits into the Big Data Lambda architecture and I will briefly explain it to you here.

Aspect 1: Integration of analytics tools in the cluster

Aspect 1, the integration of analytics tools with your cluster, has not been really the focus of the Lambda Architecture. In fact, this is missing, although it has significant architectural consequences, since it affects resources used, reusability (cf. also here) or security.

Aspect 2: Sampling databases and probabilistic data structures

Sampling databases and probabilistic data structures are most suitable for the speed-and serving layer. They allow fast processing of data while only being as accurate as needed. If one is satisfied with their accuracy, which can be expected for most of the business cases after thoughtful reconsideration, then one even won’t need a batch layer anymore.

Aspect 3: Probabilistic databases

Probabilistic databases will be initially part of the serving layer, because this is the layer the data scientists directly interact with in most of the cases. However, later it will be integral part of all layers. Especially the deployment of quantum computing, as we see it already in big Internet and High Tech companies, will drive this.


I presented in this blog post important aspects for the future of Big Data technologies. This covered the near-term future, medium-term future and long-term future.

In the near-term future we will see a better and better integration of analytics tools into Big Data technology, enabling transparent computation of sophisticated analytics models over a Big Data cluster. In the medium-term future we will see more and more usage of sampling databases and probabilistic data structures to avoid unnecessary processing of large data to save costs and resources. In the long-term future we will see that companies will build up an integrated statistical view of the world to avoid inconsistent and duplicated analysis. This enables proper strategy definition and execution of 21st century information organizations in traditional as well as new industries.


Master Data Management and the Internet of Things

Master Data Management (MDM) has matured and grown significantly over the last years. The main motivation for master data management is to have a complete and accurate view on master data objects in your organization. Master data objects describe key assets, such as machines or customers, generating value for your organization. Hence, MDM fosters processes to enhance and improve the quality of master data, so that the key assets are used properly to generate value. However, most of these processes require still manual intervention by humans. Furthermore, master data is usually not up-to-date due to its manual improvement and tracking. Especially the current state of master data is usually only entered into the system after hours or even days. This makes it difficult to act upon this state or to predict changes to it. Clearly, this can be a disadvantage compared to the competition who leverages real-time information when using their master data. For instance, one cannot predict that a customer might move to another city in the near future or that the planes one operates will require maintenance at an inconvenient time delaying an offered flight.

The vision of the Internet of Things (IoT) is to connect things sensing and acting upon their environments to the Internet and exchanging data about their state as well as their environment. IoT enables a real-time 360° view on your key assets and their interaction with the environment. Current studies estimate that by 2020 several billions of things will be connected via the internet.

Hence, it make sense to combine MDM and IoT to improve your business processes acting upon master data. These processes will benefit from an up-to-date state of master data, but can use this data to enable predictive analytics applications, such as predictive maintenance or customer retention.

I will describe in more detail both concepts and how they can be integrated. Afterwards I will discuss current challenges with respect to architectures, data models and predictive analytics applications. Finally, I will provide insights on how next generation MDM systems look like.

What is Master Data Management?

Master Data is data about the key assets in a company. Examples are customers, machines, products, suppliers, financial assets or business partners.

One should differentiate master data from transactional data, which always refers to master data. Master data object can exist on their own and do not need to refer necessarily to other data, i.e. they make sense without any relations. For instance, a customer can exist without other customers. However, the customer has usually (social) relations to other customers. A transaction for buying a product cannot exist without a customer and a product.

One of the key issues for MDM is the integration of various systems containing master data. Usually this data is inconsistent and incomplete due to various reasons. This has significant impact on the business processes using master data, which leads to significant cost and waste of resources.

Hence master data management solutions provide various means to improve master data quality automatically and manually. For instance, they offer rules engine to validate data quality and workflow engines to assign tasks to data stewards to fix incorrect data. Currently, most efforts related to improving master data quality is by improving it manually.

What is the Internet of Things (IoT)?

The Internet of Things is about a paradigm that connect any things, such as machines, cars, smartphones, thermostates or smoke detectors to the Internet where they provide information about their state and their environment to other things as well as humans.

For example, a machine can report its utilization to other machines and inform its users about alternative machines to use in case of high utilization.

The Internet of Things does not only take into account the current state of things, but also the future state of things by employing predictive analytics applications.

For instance, a car can predict based on its sensor information that the engine is likely to fail within the next seven days. It can schedule maintenance with the manufacturer so it does not fail when it is needed by the driver.

Challenges Combining MDM and IoT

The main benefits of integrating MDM and IoT are the following:

  • Automatically update master data and its state to improve value-flow of current business processes.
  • Enable prediction on master data, such as predictive maintenance of machines or predictive customer behavior, to enable new types of business processes and models.

However, currently there are some challenges integrating them.

Internet of Things and Semantic Challenges

The Internet of Things brings only value to an organization if it can use the IoT information within a proper analytics model describing the semantic relations between things and master data objects.

For instance, if the company collects only information such as sensor “A4893983” reports its location as “50.106529,8.662162” then it is of very little value to the company.

However, if it would have a proper semantic description for MDM and IoT data then it can leverage this data to generate the following information: “Customer Max Mustermann is currently at Frankfurt central station and using one of our products. His friend, Martha Musterfrau, is currently near him, but having problems with one of our products”.

These types of predictive analytics and semantic models as well IoT information require new database technologies, which will be described later.

Combining Big Data and Master Data Management

Traditional master data management solutions have not been designed with “Big Data” in mind. However, combining MDM and IoT require “Big Data”:

  • Higher data volumes due to IoT Data
  • Complex analytics queries over existing MDM data with a lot of relationships
  • Variety of information in master data objects and IoT database

This requires as well new database technologies.

Providing Prediction to Business Processes

Traditional master data management solutions only support provision of master data to business processes. However, modern master data management solutions supporting IoT will have to provide predictive analytics to business processes. Examples are answers to questions, such as the following:

  • Which of my machines is likely to fail next and which ones should be sent to maintenance?
  • What product is the customer most likely to buy next and which material do I need to buy to build it?

Relational databases are suitable for descriptive statistics, but quickly reach their limit with respect to even simple prediction models. Hence, new database technologies have to be supported.

Technology Support

Current MDM solutions are based mostly on relational SQL databases together with caching solutions. This is suitable for integrating master data objects from MDM systems into today’s business processes. Unfortunately, this makes them less suitable for predictive analytics applications due to the limitation of relational algebra. They also cannot handle a lot of relations between master data objects as it is required today (e.g. many different versions of master data objects or by master reference data, such as social network graphs or dependency graphs). This limits as well opportunities for data quality enhancements and results in poorer data quality. This leads to higher costs within the business processes using master data.

Modern MDM solutions leverage Graph databases to store and analyze master data objects as well as provide them to business processes. They offer similar transactional guarantees as relational databases, but have different storage and index structures more suitable for MDM. However, they have not become yet first class citizens in companies which currently have to build up knowledge in this areas. Nevertheless, large software vendors, such as SAP or Oracle are starting to offer graph databases as part of their databases solutions. Popular open source graph databases/processing solutions, such as OrientDB, Neo4J or Spark GraphX, TitanDB exist since several years and they can cope with large amounts of data.

Furthermore, relational databases only poorly integrate IoT data which is about the ability to digest large volumes of data and do analytics on them. This cannot be coped with anymore using vertical scaling – a prominent paradigm for relational databases, but a database cluster consisting of several communicating nodes is needed. Column-stores, such as Apache Cassandra (together with an analytics framework, such as Hadoop MapReduce or Apache Spark), Hadoop/HBase (Parquet) or SAP HANA, seem to be most suitable for this scenario. They offer high read/write throughput and thus are able to cope with the high volume of IoT data. Furthermore, they can be scaled horizontally by adding new database nodes to an existing network of nodes. Finally, you can manage load by using Apache Kafka Messaging Technology.

Find here my university-level lecture material on NoSQL & Big Data platforms.


The following figure illustrates the concept of MDM and IoT by means of an  exemplary data model. Master data objects are represented as nodes of a graph with relations to other nodes. The following master data objects can be identified: 2 electronic devices and 2 customers. The customers, Max Mustermann and Martha Musterfrau, are friends and this is represented in the Master data object graph. Furthermore each of the customers has an ownerships relation to a product (an electronic device) sold by a company.

Finally, IoT data is illustrated in the figure. This data is connected to the master data objects providing information about their state. For example, the smartphones of the customers provide information about their location (“Central Station, Frankfurt, Germany”). The IoT data of the eletronic devices provide information about their operation status. One electronic device is operating normal and the other one is broken.

examplemdmgraph The example demonstrate only a small excerpt of what is possible with a next generation master data and IoT management system. Some examples for queries that can be answered:

  • Who is the owner of devices in the state “Broken”?
  • Which customers can support other customers nearby with devices in state “Broken”?
  • Which customers influence their friends to buy new devices or recommend devices?
  • Which devices in Frankfurt are likely to fail within the next week and needs replacement?

Additional information from IoT data enables superior data quality. For instance, we can properly identify customers and devices. This avoids costly maintenance of working devices or costly replacement of non-working ones.

It is obvious that such a new system enables enhanced sales to customers because more information allows more targeted advertisement and more customization. Based on prediction models one can offer completely new value-added services.


Master Data Management enters a new area: New database technologies and the Internet of Things enable superior data quality and open up new business cases, such as predictive analytics. Ultimately this leads to new business processes offering superior value.

Nevertheless, only few MDM solutions are leveraging these new technologies yet, although these new technologies are already quiet mature. Additionally, the Internet of Things has to become more pervasive and organizations need to pressure their suppliers and customers to engage more with it.