Master Data Management and the Internet of Things

Master Data Management (MDM) has matured and grown significantly over the last years. The main motivation for master data management is to have a complete and accurate view on master data objects in your organization. Master data objects describe key assets, such as machines or customers, generating value for your organization. Hence, MDM fosters processes to enhance and improve the quality of master data, so that the key assets are used properly to generate value. However, most of these processes require still manual intervention by humans. Furthermore, master data is usually not up-to-date due to its manual improvement and tracking. Especially the current state of master data is usually only entered into the system after hours or even days. This makes it difficult to act upon this state or to predict changes to it. Clearly, this can be a disadvantage compared to the competition who leverages real-time information when using their master data. For instance, one cannot predict that a customer might move to another city in the near future or that the planes one operates will require maintenance at an inconvenient time delaying an offered flight.

The vision of the Internet of Things (IoT) is to connect things sensing and acting upon their environments to the Internet and exchanging data about their state as well as their environment. IoT enables a real-time 360° view on your key assets and their interaction with the environment. Current studies estimate that by 2020 several billions of things will be connected via the internet.

Hence, it make sense to combine MDM and IoT to improve your business processes acting upon master data. These processes will benefit from an up-to-date state of master data, but can use this data to enable predictive analytics applications, such as predictive maintenance or customer retention.

I will describe in more detail both concepts and how they can be integrated. Afterwards I will discuss current challenges with respect to architectures, data models and predictive analytics applications. Finally, I will provide insights on how next generation MDM systems look like.

What is Master Data Management?

Master Data is data about the key assets in a company. Examples are customers, machines, products, suppliers, financial assets or business partners.

One should differentiate master data from transactional data, which always refers to master data. Master data object can exist on their own and do not need to refer necessarily to other data, i.e. they make sense without any relations. For instance, a customer can exist without other customers. However, the customer has usually (social) relations to other customers. A transaction for buying a product cannot exist without a customer and a product.

One of the key issues for MDM is the integration of various systems containing master data. Usually this data is inconsistent and incomplete due to various reasons. This has significant impact on the business processes using master data, which leads to significant cost and waste of resources.

Hence master data management solutions provide various means to improve master data quality automatically and manually. For instance, they offer rules engine to validate data quality and workflow engines to assign tasks to data stewards to fix incorrect data. Currently, most efforts related to improving master data quality is by improving it manually.

What is the Internet of Things (IoT)?

The Internet of Things is about a paradigm that connect any things, such as machines, cars, smartphones, thermostates or smoke detectors to the Internet where they provide information about their state and their environment to other things as well as humans.

For example, a machine can report its utilization to other machines and inform its users about alternative machines to use in case of high utilization.

The Internet of Things does not only take into account the current state of things, but also the future state of things by employing predictive analytics applications.

For instance, a car can predict based on its sensor information that the engine is likely to fail within the next seven days. It can schedule maintenance with the manufacturer so it does not fail when it is needed by the driver.

Challenges Combining MDM and IoT

The main benefits of integrating MDM and IoT are the following:

  • Automatically update master data and its state to improve value-flow of current business processes.
  • Enable prediction on master data, such as predictive maintenance of machines or predictive customer behavior, to enable new types of business processes and models.

However, currently there are some challenges integrating them.

Internet of Things and Semantic Challenges

The Internet of Things brings only value to an organization if it can use the IoT information within a proper analytics model describing the semantic relations between things and master data objects.

For instance, if the company collects only information such as sensor “A4893983” reports its location as “50.106529,8.662162” then it is of very little value to the company.

However, if it would have a proper semantic description for MDM and IoT data then it can leverage this data to generate the following information: “Customer Max Mustermann is currently at Frankfurt central station and using one of our products. His friend, Martha Musterfrau, is currently near him, but having problems with one of our products”.

These types of predictive analytics and semantic models as well IoT information require new database technologies, which will be described later.

Combining Big Data and Master Data Management

Traditional master data management solutions have not been designed with “Big Data” in mind. However, combining MDM and IoT require “Big Data”:

  • Higher data volumes due to IoT Data
  • Complex analytics queries over existing MDM data with a lot of relationships
  • Variety of information in master data objects and IoT database

This requires as well new database technologies.

Providing Prediction to Business Processes

Traditional master data management solutions only support provision of master data to business processes. However, modern master data management solutions supporting IoT will have to provide predictive analytics to business processes. Examples are answers to questions, such as the following:

  • Which of my machines is likely to fail next and which ones should be sent to maintenance?
  • What product is the customer most likely to buy next and which material do I need to buy to build it?

Relational databases are suitable for descriptive statistics, but quickly reach their limit with respect to even simple prediction models. Hence, new database technologies have to be supported.

Technology Support

Current MDM solutions are based mostly on relational SQL databases together with caching solutions. This is suitable for integrating master data objects from MDM systems into today’s business processes. Unfortunately, this makes them less suitable for predictive analytics applications due to the limitation of relational algebra. They also cannot handle a lot of relations between master data objects as it is required today (e.g. many different versions of master data objects or by master reference data, such as social network graphs or dependency graphs). This limits as well opportunities for data quality enhancements and results in poorer data quality. This leads to higher costs within the business processes using master data.

Modern MDM solutions leverage Graph databases to store and analyze master data objects as well as provide them to business processes. They offer similar transactional guarantees as relational databases, but have different storage and index structures more suitable for MDM. However, they have not become yet first class citizens in companies which currently have to build up knowledge in this areas. Nevertheless, large software vendors, such as SAP or Oracle are starting to offer graph databases as part of their databases solutions. Popular open source graph databases/processing solutions, such as OrientDB, Neo4J or Spark GraphX, TitanDB exist since several years and they can cope with large amounts of data.

Furthermore, relational databases only poorly integrate IoT data which is about the ability to digest large volumes of data and do analytics on them. This cannot be coped with anymore using vertical scaling – a prominent paradigm for relational databases, but a database cluster consisting of several communicating nodes is needed. Column-stores, such as Apache Cassandra (together with an analytics framework, such as Hadoop MapReduce or Apache Spark), Hadoop/HBase (Parquet) or SAP HANA, seem to be most suitable for this scenario. They offer high read/write throughput and thus are able to cope with the high volume of IoT data. Furthermore, they can be scaled horizontally by adding new database nodes to an existing network of nodes. Finally, you can manage load by using Apache Kafka Messaging Technology.

Find here my university-level lecture material on NoSQL & Big Data platforms.


The following figure illustrates the concept of MDM and IoT by means of an  exemplary data model. Master data objects are represented as nodes of a graph with relations to other nodes. The following master data objects can be identified: 2 electronic devices and 2 customers. The customers, Max Mustermann and Martha Musterfrau, are friends and this is represented in the Master data object graph. Furthermore each of the customers has an ownerships relation to a product (an electronic device) sold by a company.

Finally, IoT data is illustrated in the figure. This data is connected to the master data objects providing information about their state. For example, the smartphones of the customers provide information about their location (“Central Station, Frankfurt, Germany”). The IoT data of the eletronic devices provide information about their operation status. One electronic device is operating normal and the other one is broken.

examplemdmgraph The example demonstrate only a small excerpt of what is possible with a next generation master data and IoT management system. Some examples for queries that can be answered:

  • Who is the owner of devices in the state “Broken”?
  • Which customers can support other customers nearby with devices in state “Broken”?
  • Which customers influence their friends to buy new devices or recommend devices?
  • Which devices in Frankfurt are likely to fail within the next week and needs replacement?

Additional information from IoT data enables superior data quality. For instance, we can properly identify customers and devices. This avoids costly maintenance of working devices or costly replacement of non-working ones.

It is obvious that such a new system enables enhanced sales to customers because more information allows more targeted advertisement and more customization. Based on prediction models one can offer completely new value-added services.


Master Data Management enters a new area: New database technologies and the Internet of Things enable superior data quality and open up new business cases, such as predictive analytics. Ultimately this leads to new business processes offering superior value.

Nevertheless, only few MDM solutions are leveraging these new technologies yet, although these new technologies are already quiet mature. Additionally, the Internet of Things has to become more pervasive and organizations need to pressure their suppliers and customers to engage more with it.

Big Data: Bring Computation to Data

Big Data is the topic of the coming years. Even today large Internet companies store exabytes of data and their revenue model is based on selling products as well as services around this data. Consequently, they need to process data using advanced statistical methods, such as machine learning. Hence, they need to think about how to do this efficiently. Currently, especially in-memory is hyped to address this issue. However, this is only one aspect. A fundamentally more important aspect is where the data is processed in a distributed multi-node data environment.

A brief history on software architectures

In the beginning of software development, many applications have been single monolithic applications. They have been deployed on a single computer. This lead to several problems, such as that developers could hardly reuse code of monolithic applications and the approach did not scale very well since it was limited to a single computer. The first problem has been addressed by introducing different layers into the architecture. The resulting architectures are usually based on three layers (see next figure): data layer, service layer and presentation layer. The data layer handles any functionality for managing data, such as querying or storing it. The service layer implements business logic, e.g. it implements business process. The presentation layer allows the user to interact with the implemented business processes, e.g. entering of new customer data. The layers communicate with each other using well-defined interfaces implemented today in REST, OData, SOAP, Websockets or HTTP/2.0. threelayerarchitecture

With the emergence of the Internet, these layers had to be put physically on different machines to provide larger scalability. However, they have never been designed with this in mind. The network layer has only limited transport bandwidth and capacity. Indeed, for very large data it can be faster to store it on a large drive and transport it by truck to its destination than doing it by the network.

Additionally, during development scalability of data computation is of less interest, because in the Internet world it is often not known how many people will have access to an application and this may change over time. Hence, you need to be able to scale dynamically up an down. I observe that more and more of the development efforts in this area have moved to operations, who need to implement monitors, load-balancer and other technology to scale applications. This is also the reason why DevOps is a popular and emerging paradigm for developing and operating Internet-scale web applications, such as Netflix.

Towards New Software Architectures: Bring Computation to Data

The multiple layer approach does make sense and you could it even split it into more layers (“services”), but you have to evaluate carefully complexity and reusability of your service design. More important, you will have to think about new interfaces, because if components are located on different machines or different memory instances, your application will spend a lot of time for moving data between them. For instance, the application logic on the application server may request all customer transactions from the database and then correlate them to write the results back into the database. This requires a lot of data to be transferred from the database to the application server and potentially costs a lot of performance. Finally, it does not scale at all.

This problem first emerged when companies introduced the first Online Analytical Processing (OLAP) engines as part of business intelligence solutions for understanding their business. Database queries proved as too simple and would require to transfer first a lot of data to the application server. Hence, the Structured Query Language (SQL) for databases was extended to cope with these new requirements (e.g. the CUBE operator). Moreover, you can define your own custom functions (e.g. SQL Stored procedures), but they have to be implemented very vendor specific. For instance, distributed databases based on Apache Hadoop support custom functions. However, you can integrate sometimes other programming languages, such as Java. While stored procedures are already an improvement in terms of security (protection against SQL injection attacks), they have the problem that it is very difficult to write sophisticated programs to handle modern Big Data applications. For instance, many applications require machine learning, statistical correlation or other statistical methods. It is difficult to write them as stored procedures and to maintain support for different vendors. Furthermore, it leads again to monolithic applications. Finally, they are not dynamic – the application cannot decide to do any new computation on the fly without reimplementing it in the database layer (e.g. implement a new machine learning algorithm). Hence, I suggest another way to address this issue.

A Standard for Bringing Computation to Data?

As mentioned, we want to support modern Big Data applications by providing suitable language support for machine learning and statistical methods on top of any database system (e.g. MySQL, Hadoop, Hbase or IBM DB2). The next figure illustrates the new approach. The communication between the presentation and service layer works as usual. However, the services do not call functions on the data layer, but send any data-intensive computation they want to perform as an R script to the data layer, which executes it and only sends back the result.


I observed that the programming language R for statistical computing has been recently integrated in various data environments, such as transactional databases, Apache Hadoop clusters or in-memory databases, such as SAP HANA. Hence, I think R could be a suitable language for describing computation that operates on data. Additionally, R has already a lot of built-in packages for machine learning or statistical data processing. Finally, depending on the openness of the underlying data environment, you can integrate R tightly into it, so you may not have to do extensive in-memory transfers.

The advantage of the approach are:

  • business logic stays in the service level and does not move to the data layer
  • You can easily add new services without modifying the data layer – so you avoid a tight coupling, which makes it easier to change the data layer or to introduce new functionality
  • You can mine R scripts generated by services to determine which computation the user is likely to do next to start executing it before the user requests it.
  • Caching and distribution of data processing can be based on a more sophisticated analysis of the R scripts using the R Profiler Rprof
  • R is already known by many business analysts or social scientists/psychologists

However, you will need to have some functionality for governing the execution of the R scripts in the data layer. This includes decisions on when to schedule computation or creating new computing/data nodes (e.g. real-time vs batch). This will require a company-wide enterprise architecture approach where you need to define which data should be real-time and which data should be batch-processed. Furthermore, you need to take into account security and separation of concerns.

In this context, Apache Hadoop might be an interesting solution from the technology perspective.

What is next

The aforementioned approach is only the beginning. By using this solution, you can think about true inter-cloud deployments of your application. Finally, you can enable inter-organizational data-processing business processes.