Big Data has matured over the last years and is becoming more and more a standard technology used in various industries. Coming from established concepts, such as OLAP or OLTP, in context of Big Data, I go in this blog post beyond them describing what is needed for next generation applications, such as autonomous cars, industry 4.0 and smart cities. I will cover three new aspects: (1) making the underlying technology of predictive analytics transparent to the data scientist (2) avoiding Big Data processing of one large scale dataset by employing sampling and probabilistic datastructures and (3) ensuring quality and consistency of predictive analytics using probabilistic databases. Finally, I will talk about how these aspects change the Big Data Lambda architecture and briefly address some technologies covering the new three aspects.
Big Data has emerged over the last years as a concept to handle data that requires new data modeling concepts, data structures, algorithms and/or large-scale distributed clusters. This has several reasons, such as large data volumes, new analysis models, but also changing requirements in the light of new use cases, such as industry 4.0 and smart cities.
During investigations of these new use cases it quickly came apparent that current technologies, such as relational databases would not be sufficient to address the new requirements. This was due to inefficient data structures as well as algorithms for certain analytics questions, but also to the inherent limitations of scaling them.
Hence, Big Data technologies have been developed and are subject to continuous improvement for old and new use cases.
Big Online Transaction Processing (OLTP)
OLTP has been around for a long time and focuses on transaction processing. When the concept of OLTP emerged it has been usually a synonym for simply using relational databases to store various information related to an application – most people forgot that it was related to processing of transactions. Additionally, it was not about technical database transactions, but business transactions, such as ordering products or receiving money. Nevertheless, most relational databases secure business transactions via technical transactions by adhering to the ACID criteria.
Today OLTP is relevant given its numerous implementations in enterprise systems, such as Enterprise Resource Management systems, Customer Relationship Management systems or Supply Chain Management systems. Due to the growing complexity of international organisations these systems tend to have more and more data and – from a data volume point of view – they tend to generate a lot of data. For instance, large online vendor can have several exabyte of transaction data. Hence, Big Data happen also for OLTP. Particularly, if this data needs to be historized for analytical purposes (see next section).
However, one important difference from other systems is the access pattern: Usually, there are a lot of concurrent users, who are interested in a small part of the data. For instance, a customer relation agent adds some details about a conversation with a customer. Another example is that an order is updated. Hence, you need to be able to find/update a small data set in a much large data set. Different mechanismto handle a lot of data for OLTP usage exist in relational database systems since a long time.
Big Online Analytical Processing (OLAP)
OLAP has been around nearly as long as OLTP, because most analysis have been done on historized transactional data. Due to the historization and different analysis needs the amount of data is significant higher than in OLTP systems. However, OLAP has a different access pattern: Less concurrent users, but they are interested in the whole set of data, because they want to generate aggregated statistics for them. Hence, a lot of data is usually transferred into an OLAP system from different source systems and afterwards it is only read very often.
This has led very early to the development of special OLAP databases for storing data for multidimensional analysis in cubes to match the aforementioned access pattern. They can be seen as very early NoSQL databases, although they have not been described as such at this time, because the term NoSQL databases appeared only much later.
While data from OLTP systems have been originally the primary source for OLAP systems, new sources of data have appeared, such as sensor data or social network graphs. This data goes beyond the capability of OLTP or special OLAP databases and requires new approaches.
Going beyond OLTP and OLAP
Aspect 1: Predictive Analytics
Data scientists employing predictive analytics are using statistic and machine learning techniques to predict how a situation may evolve in the future. For example, they predict how the sales will evolve given existing sales and patterns. Some of these techniques exist already since decades, but only since recently they make more sense, because more data can be processed with Big Data technologies.
However, current Big Data technologies, such as Hadoop, are not transparent to the end user. This is not really an issue with the Big Data technologies themselves, but with the tools used for accessing and processing the data, such as R, Matlab or SAS.
They require that the end user thinks about writing a distributed analysis algorithms, e.g. via map/reduce programs in R or other languages to do their analysis. The standard library functions for statistics can be included in such distributed programs, but still the user has to think about how to design the distributed program. This is undesirable, because these users are usually not skilled enough to design them optimally. Hence, frustration with respect to performance and efforts is likely to occur.
Furthermore, organisations have to think about an integrated repository where they store these programs to enable reuse and consistent analytics. This is particularly difficult, because these programs are usually maintained by business users, who lack proper software management skills.
Unfortunately, it cannot be expected that the situation changes very soon.
Aspect 2: Sampling & Probabilistic Data Structures
Surprisingly often when we deal with Big Data, end users tend to execute queries over the whole set of data independent if it is has 1 million rows or 1 billion rows.
While it is certainly possible to process a data set of nearly any size with modern Big Data technologies, one should carefully think if this is desired due to increased costs, time and efforts needed.
For instance, if I want to calculate the average value of all transactions then I can calculate the average of all transactions. However, I could take also a random sample of 5 % of the transactions and know that the average of this sample is correct with an error of +-1 % in comparison of the total population. For most decision making this is perfectly fine. However, I needed to process only a fraction of the data and can now do further analysis due to the saved time and resources. This may even lead to better informed decisions.
Luckily, there are already technologies allowing this. For example, BlinkDB, which allows – amongst others – the following SQL queries:
- Calculate the average of the transaction values within 2 seconds with some error:SELECT avg(transactionValue) FROM table WITHIN 2 SECONDS
- Calculate the average of the transaction values within given error boundaries :SELECT avg(transactionValue) FROM table ERROR 0.1 CONFIDENCE 95.0%
These queries executed over large-scale dataset are executed much faster than in any other Big Data technology not employing sampling methods.
Particularly for predictive algorithms this makes sense, because they have anyway underlying assumption about statistical errors., which can be easily integrated by a data scientists with errors from sampling databases.
Probabilistic data structures, such as Bloom filters or HyperLoglog, are aiming in the same direction. They are more and more implemented in traditional SQL databases and NoSQL databases.
Bloom filters can tell you if an element is part of a set of elements without browsing through the set of elements by employing a smart hash structure. This means you can skip trying to access elements on disk which anyway do not exist. For instance, if you want to join two large datasets you need only to load the data for which there is a corresponding value in the other dataset. This dramatically improves the performance, because you need to load less data from slow storage.
However, bloom filters can only tell you if an element is definitely not in the set. This means it can only tell you with a certain probability if a given element is in the set. However, this is for the the given use cases of bloom filters no problem.
Hyperloglog structures allow you to count the number of unique elements in a set without storing the whole set.
For example, let us assume you want to count the unique listeners of a song on the web.
If you use traditional SQL technologies then you need to store for 5 million unique listeners (not uncommon on Spotify) 80 MB of data for one song. It takes also several seconds for each web site request just to do the count unique or to insert a new unique listener.
By using HyperLoglog you need to store at a maximum only few kilobytes of information (usually much less) and can read/update instantaneously the counted unique listeners.
Usually these results are correct within a minor configurable error margin, such as 0,12%.
Find some calculations in my lecture materials.
Aspect 3: Probablistic Databases
Your company has deployed a Big Data technology and uses it productively. Data scientists generate new statistical models on a daily basis all over the world based on your enormous data reservoir. However, all these models have only a very selective view on the real world and different data scientists use different methods and assumptions to do the same analysis, i.e. they have a different statistical view on the same object in the real world.
This is a big issue: Your organization has an inconsistent view on the market, because different data scientists may use different underlying data, assumptions and probabilities. This leads to a loss of revenue, because you may define contradictory strategies. For example, data scientist A may do a regression analysis on sales of ice cream based on surveys in North France with a population of 1000. Data scientist B may independently do a regression analysis on sales of ice cream in South Germany with a population of 100. Hence, both may come up with different prediction for ice cream sales in middle Europe, because they have different views on the world.
This is not an issue only with collaboration. Current Big Data technologies do not support a consistent statistical view of the world on your data.
This is where probabilistic databases will play a key role. They provide an integrated statistical view on the world and can be queried efficiently by employing new techniques, but still supporting SQL-style queries. For example one can query the location of a truck from a database. However, instead of just one location of one truck you can get several locations with different probabilities associated to them. Similarly you may join all the trucks with a certain probability being close to goods at a certain warehouse.
However, the technologies lack still commercial maturity and more research is needed.
Many organization are not ready for this next big step in Big Data and it is expected that this will take at least 5-10 years until the first are ready to employ such technology.
Context of the Lambda architecture
You may wonder how this all fits into the Big Data Lambda architecture and I will briefly explain it to you here.
Aspect 1: Integration of analytics tools in the cluster
Aspect 1, the integration of analytics tools with your cluster, has not been really the focus of the Lambda Architecture. In fact, this is missing, although it has significant architectural consequences, since it affects resources used, reusability (cf. also here) or security.
Aspect 2: Sampling databases and probabilistic data structures
Sampling databases and probabilistic data structures are most suitable for the speed-and serving layer. They allow fast processing of data while only being as accurate as needed. If one is satisfied with their accuracy, which can be expected for most of the business cases after thoughtful reconsideration, then one even won’t need a batch layer anymore.
Aspect 3: Probabilistic databases
Probabilistic databases will be initially part of the serving layer, because this is the layer the data scientists directly interact with in most of the cases. However, later it will be integral part of all layers. Especially the deployment of quantum computing, as we see it already in big Internet and High Tech companies, will drive this.
I presented in this blog post important aspects for the future of Big Data technologies. This covered the near-term future, medium-term future and long-term future.
In the near-term future we will see a better and better integration of analytics tools into Big Data technology, enabling transparent computation of sophisticated analytics models over a Big Data cluster. In the medium-term future we will see more and more usage of sampling databases and probabilistic data structures to avoid unnecessary processing of large data to save costs and resources. In the long-term future we will see that companies will build up an integrated statistical view of the world to avoid inconsistent and duplicated analysis. This enables proper strategy definition and execution of 21st century information organizations in traditional as well as new industries.