Batch-processing & Interactive Analytics for Big Data – the Role of in-Memory

In this blog post I will discuss various aspects of in-memory technologies and describe how various Big Data technologies fit into this context.

Especially, I will focus on the difference between in-memory batch analytics and interactive in-memory analytics. Additionally, I will illustrate when in-memory technology is really beneficial. In-memory technology leverages the fast main memory and processor caches to deliver superior performance.

While initially the deployment of in-memory technology seems to be attractive, companies have to carefully design how they use the scarce resource memory for big data sets, because the amount of data tends to grow when the company masters successfully Big Data technologies. For instance, they need to think about the issue of memory fragmentation, scheduling, capacity management, the decision how the data should be structured in-memory or making the decision about what data should be represented in-memory.

I will explain that some paradigms introduced for non-in-memory analytics, such as the paradigm that it is better if you do not need to read data than reading it all, is still very valid for in-memory technologies.

Finally, I will give an outlook on current Big Data technologies and their strength and weaknesses with respect to in-memory batch analytics and interactive in-memory analytics.

The Concept of In-Memory

The concept of in-memory became more and more popular around 2007/2008, although the fundamental concepts behind it exist since decades. It was marketed quite heavily by SAP and its HANA in-memory database at this time.

Around the same time, a different paradigm appeared, the concept of distributed Big Data platforms.

In the beginning, both were rather disconnected, where in-memory technologies relied on one “big” machine and distributed data platforms consisted out of a huge set of different more commodity-like machines. In-memory was at this time often associated with interactive queries with fast responses for comparable small datasets fitting in-memory on machine and Big Data platforms for long-running analytics queries crunching large volumes of data scattered over several nodes.

This changed recently. Particularly, in-memory techniques have been brought to long-running analytics queries and distributed Big Data platforms to interactive analytics. The assumed benefit for both cases is that more data can be handled in more complex ways in a shorter time.

However, you need to carefully look what kind of business benefits you can gain from doing faster or more analytics.

Public sector organizations over various domains have significant benefits, because their “ROI” is usually measured in non-monetary terms as benefits for society. A faster, fair and more transparent or scientifically correct analysis can be one example of such a benefit. Additionally, supervision of the private sector need to be on the same level as the private sector.

Traditional private sector organizations on the other hand will have to invent new business models and convince the customer. Here, new machine learning algorithms on large data volumes are more beneficial in comparison of traditional data warehouse reports. Internet Industries including the Internet of Things and autonomous robots obviously have some benefits let it be the processing of large data volumes and/or the need to react quickly to events in the real world.

The Difference between in-memory batch processing and interactive analytics

Often people wonder why there is still a difference between batch processing and interactive analytics when using in-memory. In order to answer this question let us quickly recap the difference between the two:

  • Distributed big data processes: They are long-running because they need to query data residing on several nodes and/or calculations are very complex requiring a lot of computing power. Usually they make calculation/processed data available in a suitable format for interactive analytics. Usually, these processes are planned and scheduled in advance.
  • Interactive analytics: These are often ad-hoc queries from low to very high complexity. Usually it is expected that they return results within seconds or minutes. However, they can also take much longer and are then candidate for distributed big data processes, so that results are precomputed and stored for interactive analytics. Interactive analytics go beyond standard tables to return results faster.

The results of them can be either used by humans or by other applications, e.g. applications that require prediction to provide an automated service to human beings. Basically both approaches fit to the Lambda architecture paradigm.

In-memory technologies can basically speed up both approaches. However, you need to carefully evaluate your business case for this. For example, it make sense to speed up your Big Data batch processes to finish before your people start working or to have more time to do perform additional processes on the same resources – This is particularly interesting if you have a plethora of different large datasets where different analytics can make sense. With respect to interactive analytics, you benefit most if you have specific analytics algorithms that benefit from memory locality, e.g. iterative machine learning algorithms.

If you have people working on large tables using aggregations then you should make them aware that it make more sense to work with samples, in-memory indexes and data structures as well as high parallelism. Aggregating data of a large table in-memory is very costly and the speed difference to tables on disk is most likely not much. The core paradigm should be here: do not read what is needed.

To make it short: Be aware of your priorities to leverage speed-ups by using in-memory technology. Not everything has to be in-memory.

Nevertheless, you should first leverage all possible optimizations without using in-memory technology. An inefficient data structure on-disk is not a better structure if it is in-memory. Additionally, you should think about how much data you need and how precise your results need to be. As I wrote in a previous blog post, this can save you a lot of time that you can use to perform further analytic tasks.

We will in the following describe some challenges with in-memory that you need to tackle to be successful with in-memory technologies.

Challenges with in-memory

Memory fragmentation

Problem

Memory fragmentation does not only occur with in-memory technologies, but on any storage. You can have internal fragmentation, where you allocate more memory to an application than needed or external fragmentation, where you deallocate memory, but new data does not fit into the deallocated memory and you have to use additional memory.

However, it can be rather problematic with in-memory technologies because main memory is usually the smallest storage available. In the context of Big Data, where there can be a huge diversity of different data sets that grow from time to time as well as different algorithms that use memory in a different way, this becomes much quicker apparent as if there would be just one static data set that does not change and is always processed the same way.

The issue here with memory fragmentation is that you have less memory than physically available – potentially a lot less. This leads to unexpected performance degradation and the need to spill over to slower disk space to continue the computation, which may lead to thrashing.

You cannot avoid memory fragmentation, because one cannot look into the future when which data set is loaded and what computation is needed.

Solution

As a first step to handle memory fragmentation is to have a list of processes and interactive queries that are regularly executed and to look at them to see any potential issues with memory fragmentation. This can be used during monitoring to be aware of memory fragmentation. One indicator can be that the available memory does not match the memory that should be consumed. Another indicator can be a lot of spills to disk.

There are several strategies to handle identified memory fragmentation. In case of in-memory batch processes, one should release all the memory after the batch process have been executed. Furthermore, one should use distributed Big Data technologies, which usually work with fixed block sizes from the distributed file system layer (e.g. HDFS). In this case you can partially avoid external fragmentation. You can avoid it only partially, because many algorithms have some temporary data or temporary relevant data which needs to be taken into account as well.

If you have interactive analytics, a very common recommendation even by vendors of popular memcache solutions is to restart the cache from time to time and thereby forcing to reload the data in an ordered manner into cache avoiding fragmentation. Of course, once you add, modify, remove data you have again some fragmentation, which will grow over time.

Another similar approach is called compaction, which exist in traditional relational databases and big data systems. Compactation reduces fragmentation that occurs due to updates, deletion and insertion of new data. The key here is that you can gain performance benefits for your users if you schedule it to time where the system is not used. Surprisingly, often people do not look at compaction, although it has significant impact on performance and memory usage. Instead they rely only on non-optimal default settings, which usually not for large scale analytics, but smaller scale OLTP usage. For instance, it can make sense for large scale analytics to schedule compaction after loading all the data and no new data is arriving before the next execution of a batch processing process.

What data should be in-memory? About temperature of data…

The Concept

It is not realistic to have all your data in-memory. This is not only due to memory fragmentation, but also costs for memory, fault-tolerance, backups and many others. Hence, you need an approach to decide which data should be in-memory.

As already described before it is important to know your data priorities. Quite often these priorities change, e.g. new data is introduced, or data simply becomes outdated. Usually it is reasonable to expect that data that is several months or years old will not often be touched, expect for research purposes. Here is where the temperature of data, i.e. hot, warm and cold data comes into play.

Hot data has been used recently quiet frequently and is likely to be used quiet frequently in the near future.

Warm data has been used recently not as frequently as hot data and it is NOT likely to be used frequently in the near future.

Cold data has not been used recently and is not likely to be used in the near future.

Usually hot data resides on CPU caches and mainly on main memory. Warm data resides mainly on local disk drives and only a small fraction in main memory. Cold data resides mostly on external slow storage potentially accessed via the network or in the cloud.

Managing Temperature of Data

The concept of temperature of data applies to batch processes and interactive analytics equally. However, you need to think about what data needs to be kept hot, warm and cold. Ideally this happens automatically. For example many common in-memory system provide the strategy LRU (last recently used) to automatically move hot data to warm data and eventually to cold data and the other way around. For instance, Memcached or SAP HANA support this as a default strategy.

This seems to be a good default strategy, especially if you cannot or do not want to look into more detail about the temperature of data. Indeed, it has also some sound assumptions, since it is based on the principal of locality, which is also key to distributed Big Data processes and many machine learning algorithms.

However, there are alternative strategies to LRU that you may want to think about:

  • Most recently used (MRU): The most recently used memory element is moved to warm and eventually to cold storage. This assumes that there is stable data that is more relevant than having the newest data.
  • Least frequently used (LFU): The data which has been least frequently used will be moved to warm storage and eventually to cold storage. The advantage here is that recently used data that has been only accessed once is quickly moved as well as data which has been accessed quiet frequently, but not in the near past, will stay in-memory.
  • Most frequently used (MFU): The data which has been most frequently used in the past will be moved from warm storage and eventually to cold storage. The idea here is that the more data has been used the less valuable it will be and hence will be accessed much less in the near-future.
  • Any combination of the above

Obviously, the most perfect strategy would predict what data would least be used in the future (“Clearvoyant algorithms”) and move data accordingly to hot, warm, and cold storage. This is of course not exactly possible, but a sound understanding on how people use your Big Data platform can come pretty close to that ideal.

Of course, you can implement also more sophisticated machine learning algorithms that take into account the environment to predict what data and computation will be required in the future given a new task (cf. here for an approach for scheduling multimedia tasks in the cloud based on machine learning algorithms – the use case is different but the general idea the same). Unfortunately, most of the popular Big Data and in-memory solutions do not implement such an approach yet.

How should the data be structured?

Many people, including business people, have only the traditional world of tables, consisting of rows or columns, in mind when using data. In fact, a lot of analysis is based on this assumption. However, while tables are simple they might not be the most efficient way to store data in-memory or even to process it.

In fact, depending on your analysis different formats make sense, such as:

  • Use the correct data type: If you have numbers use a data types that support numbers, such as integer or double. Dates can often be represented as integers. This requires less storage and the cpu can read an integer represented as integer in magnitudes faster than an integer represented as a string. Similarly even with integer, you should select an appropriate size. If your numbers fit into a 32-bit integer then you should prefer storing it as 32-bit instead of 64-bit. This will increase your performance significantly. However, the key message here is store the data with the right data type and use the available ones and understand their advantages as well as limitations.
  • Column-based: Data is stored in columns instead of rows. This is usually beneficial if you need to access one or more full columns of a given data set. Furthermore, it enables one to avoid reading data that is not relevant using storage indexes (min/max) or bloom filters.
  • Graph-based: Data is stored as so-called Adjacency lists or sometimes as Adjacency matrices. This shows much more performance than row-based or column-based storage with respect to graph algorithms, such as strongly connected components, shortest path etc. These algorithms are useful for social network analytics, financial risks of assets sold by different organizations, dependencies between financial assets etc.
  • Tree-based: Data is stored in a tree structure. Trees can be searched usually comparable fast and is often used for database indexes to find out in which data block a row is stored.
  • Search indexes for terms in unstructured text. This is usually useful for matching data objects, which are similar, but do not have unique identifiers. Alternatively, they can be used for sentiment analysis.Traditional database technology shows, for example, a terrible performance for these use cases – even in-memory.
  • Hash-Clustering Indexes: This can be used in columns stores by generating a hash out of the values of several columns for one row. This hash is stored as another column. It can be used for quickly searching for several criteria at the same time by using only one column. This reduces the amount of data to be processed at the expense of additional storage needed.

Furthermore, the same data might be stored in different formats on warm or cold storage, meaning that you have to decide if you want to have redundant data or generate each time from scratch the optimal storage of data for a given computation.

Compression can make sense for data in-memory, because it enables storing more data in-memory instead of slower disk drives.

Unfortunately, contrary to the strategies to manage data temperature, there are currently no mature strategies to support you automatically how to store the data. This is a manual decision and thus requires good knowledge how your Big Data platform is used.

Do we always need in-memory?

With the need for processing large data sets some things became apparent: It is even with new technologies, such as in-memory or Big Data platforms, sometimes very inefficient to process data by looking at all of the data – it is better not to read data at all!

Of course, this means you should not read not-relevant data. For instance, it was very common in traditional databases to read all the rows to find out matching rows according to a query. Even when using indexes, some irrelevant data is read when scanning the index, although storing the index as a tree structure increased search performance already a lot.

More modern solutions use storage indexes and/or bloom filters to decide which rows they need to read. This means they can skip blocks of data where the rows not matching to a query are not contained (cf. here for implementation in Apache Hive).

Similarly, probablistic data structures, such as Hyperloglog or data based on sampling (cf. here) enables one to avoid reading all the data again or at all. In fact, here you can even skip “relevant” data – as long as you read enough data to provide correct results within a small error margin.

Hence, even with in-memory technologies it is always better to avoid reading data. Even if the data is already in-memory, the CPU needs more time the more data it has to process – a simple but often overlooked fact.

The impact of Scheduling: Pre-emption

Once your Big data platform or in-memory platform grows, you will not only get more data, but also more user working on it in parallel. This means if they use interactive queries or schedule Big Data processes then they need to share resources, including memory and CPU. Especially when taking into account speculative execution. As described before, you ideally have a general big picture on what will happen, especially with main memory. However, in peak times, but for some Big Data deployments also most of the time, the resources are not enough, because of cost or other reasons.

This means you need to introduce scheduling according to scheduling policies. We briefly touched this topic before, because the concept of temperature of data implies some kind of scheduling. However, if you have a lot of users the bottleneck is merely the number of processors that process data. Hence, sometimes some analytics by some users are partially interrupted to make some resources free for other users. These users may potentially use different data sets meaning that some data might be moved also from main memory to disk drives. After the interrupted tasks are resumed they may need to reload data from disk drives to memory.

This can make performance experience sometimes unpredictable and you should be aware of it so you can react properly to incidents created by users or do a more informed capacity management.

Big Data technologies for in-memory

In-memory batch processing

There are several in-memory batch processing technologies for Big Data platforms. For example, Apache Spark or Apache Flink. In the beginning, these platform especially Spark, had some drawbacks by representing everything as Java-Objects in memory. This would mean, for instance, a 10 character String would consume 6 times more memory then representing it as an array of bytes.

Luckily this changed and data is now stored in-memory in a columnar fashion supporting also to skip data on disk that is not relevant (via predicate pushdown and an appropriate disk storage format, such as ORC or Parquet).

Additionally, both support graph batch processing and processing of continuous streams in-memory. However, both rely on a common abstraction for a data structure in which they represent other data structures, such as graphs. For example, in case of Spark it is Resilient Distributed Datasets (RDD)/dataframes). This means they have not as much performance as a highly specialized graph engine, but they are more generic and it is possible to integrate them with other data structures. For most of the current use cases it is sufficient.

Additionally, different processing algorithms, mainly in the area of machine learning are supported.

Sometimes you will see that they are also advertised as interactive platforms. However, this is not their core strength, because they do not support, for example, the concept of data temperature automatically, i.e. the developer is fully responsible to take into account hot, warm, cold data or to implement a strategy as described above. Additionally, they do not provide index support for data in-memory, because this is usually much less relevant for batch processes. Hence, if you want to use these technologies for inter-active analysis you have to develop some common IT components and strategies how to address temperature of data and the do not read irrelevant data paradigm.

In any case you have to think about scheduling strategies to optimize your resource usage of your available infrastructure.

Depending on your requirements, in-memory batch processing is not needed in all cases and your big data platform should support both: in-memory batch processes, but also non in-memory batch processes to be efficient. Especially, if your batch process only loads as well processes once the data without re-reading parts of the data then you won’t benefit a lot from in-memory.

Interactive in-memory analytics

There are several technologies enabling interactive in-memory analytics. One of the older – but still highly relevant – ones is memcached for databases. Its original use case was to speed up web applications accessing the database with many user accessing, i.e. writing and reading, in parallel the same data. Similar technologies are also used for Master Data Management (MDM) systems, because they need to deliver and receive data from a lot of sources to different systems as well as business processes with many users. This would be difficult if one relies only on databases.

Other technologies focus on the emerging Big Data Platforms based on Hadoop, but also augment in-memory batch processing engines, such as Spark. For instance, Apache Ignite provides functionality similar to memcached, but also supporting Big Data platforms and in-memory batch processing engines. For example, you can create shared RDDs for Spark. Furthermore, you can cache Hive tables or partitions in-memory. Alternatively, you can use the Ignite DataGrid to cache selected queries in-memory.

These technologies support advanced in-memory indexes (keep in mind: it is always better not to read data!) and automated data temperature management. Another example is Apache Tachyon.

There are also very specialized interactive in-memory analytics engines, such as TitanDB for graphs. TitanDB is based on the Tinkerpop graph stack including the interactive Graph query (or graph traversal) language Gremlin. SAP HANA is a specific in-memory column database for OLTP, OLAP, text-analytics and graph applications. It has been extended to a full application stack cloud platform based on in-memory technology.

Taking into account scheduling is much more tricky with interactive analytics, because one does not know what the users exactly will do and prediction engines of user behavior for interactive analytics are currently nearly non-existing.

However, you can define different categories of interactive analytics (e.g. simple queries, complex queries, machine learning, graphs ….) and determine your infrastructure as well as its configuration based on these categories.

Conclusions

It makes sense to distinguish between in-memory batch processes and in-memory analytics. In-memory batch processes can be planned and scheduled easier in advance. Additionally, one can better optimize resources for this. They also are more focused towards processing all data. Specific technologies for distributed in-memory Big Data exists and are complementary to technologies for interactive in-memory analytics. The main difference are additional indexes and automated support for the concept of data temperature.

Even for in-memory technology the key concept of Big Data to not read data that is not relevant is of high importance. Processing terabytes of data in-memory even though only a subset is relevant is a waste of resources and particularly time. This is specially difficult to handle for interactive in-memory analytics where the user can do what they want. Hence, automated and intelligent mechanisms to support this are highly desirable. They should be preferred to manual developing the right data model and structures.

Another key concept is to have the right data structure in-memory for optimal processing. For instance, graph structures show much more performance in comparison to relational row or column-based structures that need to be joined very often to perform graph algorithms. Furthermore, probabilistic data structures and probabilistic sampling queries have a growing importance. Depending on your needs you might have the same data represented redundant in different data structures for different analysis purposes.

Finally, distinguishing interactive analytics and batch processing is not always that straight forward. For instance, you can have a batch process running 5 minutes, but the results are queried 1000 times and thus avoiding each time 5 minutes run time can be very beneficial. On the other hand you can have an interactive query by one user which takes 60 minutes, but it is only needed by one user once. This may also change over time, so it is important that even after development of a solution that you monitor and audit it regularly by business users and technical users to check if the current approach still make sense or another approach makes more sense. This requires a regular dialogue even after go-live of a Big Data application.

The convergence of both concepts requires more predictive algorithms for managing data in-memory and for queries. These can be seen only in their first stages, but I expect much more over the coming years.

Scenarios for Inter-Cloud Enterprise Architecture

The unstoppable cloud trend has arrived at the end users and companies. Particularly the first ones openly embrace the cloud, for instance, they use services provided by Google or Facebook. The latter one is more cautious fearing vendor lock-in or exposure of secret business data, such as customer records. Nevertheless, for many scenarios the risk can be managed and is accepted by the companies, because the benefits, such as scalability, new business models and cost savings, outweigh the risks. In this blog entry, I will investigate in more detail the opportunities and challenges of inter-cloud enterprise applications. Finally, we will have a look at technology supporting inter-cloud enterprise applications via cloudbursting, i.e. enabling them to be extended dynamically over several cloud platforms.

What is an inter-cloud enterprise application?

Cloud computing encompasses all means to produce and consume computing resources, such as processing units, networks and storage, existing in your company (on-premise) or the Internet. Particularly the latter enable dynamic scaling of your enterprise applications, e.g. you get suddenly a lot of new customers, but you do not have the necessary resources to serve them all using your own computing resources.

Cloud computing comes in different flavors and combinations of them:

  • Infrastructure-as-a-Service (IaaS): Provides hardware and basic software infrastructure on which an enterprise application can be deployed and executed. It offers computing, storage and network resources. Example: Amazon EC2 or Google Compute.
  • Platform-as-a-Service (PaaS): Provides on top of an IaaS a predefined development environment, such as Java, ABAP or PHP, with various additional services (e.g. database, analytics or authentication). Example: Google App Engine or Agito BPM PaaS.
  • Software-as-a-Service (SaaS): Provides on top of a IaaS or PaaS a specific application over the Internet, such as a CRM application. Example: SalesForce.com or Netsuite.com.

When designing and implementing/buying your enterprise application, e.g. a customer relationship management (CRM) system, you need to decide where to put in the cloud. For instance, you can put it fully on-premise or you can put it on a cloud in the Internet. However, different cloud vendors exist, such as Amazon, Microsoft, Google or Rackspace. They offer also a different flavor of cloud computing. Depending on the design of your CRM, you can put it either on a IaaS, PaaS or SaaS cloud or a mixture of them. Furthermore, you may only put selected modules of the CRM on the cloud in the Internet, e.g. a module for doing anonymized customer analytics. You will also need to think about how this CRM system is integrated with your other enterprise applications.

Inter-Cloud Scenario and Challenges

Basically, the exemplary CRM application is running partially in the private cloud and partially in different public clouds. The CRM database is stored in the private cloud (IaaS), some (anonymized) data is sent to different public clouds on Amazon EC2 (IaaS) and Microsoft Azure (IaaS) for doing some number crunching analysis. Paypal.com is used for payment processing. Besides customer data and buying history, the databases contains sensor information from different point of sales, such as how long a customer was standing in front of an advertisement. Additionally, the sensor data can be used to trigger some actuators, such as posting on the shop’s Facebook page what is currently trending, using the cloud service IFTTT. Furthermore, the graphical user interface presenting the analysis is hosted on Google App Engine (PaaS). The CRM is integrated with Facebook and Twitter to enhance the data with social network analysis. This is not an unrealistic scenario: Many (grown) startups already deploy a similar setting and established corporations experiment with it. Clearly, this scenario supports cloud-bursting, because the cloud is used heavily.

I present in the next figure the aforementioned scenario of an inter-cloud enterprise application leveraging various cloud providers.

intercloudarchitecture

There are several challenges involved when you distribute your business application over your private and several public clouds.

  • API Management: How to you describe different type of business and cloud resources, so you can make efficient and cost-effective decisions where to run the analytics at a given point in time? Furthermore, how to you represent different storage capabilities (e.g. in-memory, on-disk) in different clouds? This goes further up to the level of the business application, where you need to harmonize or standardize business concepts, such as “customer” or “product”. For instance, a customer described in “Twitter” terms is different from a customer described in “Facebook” or “Salesforce.com” terms. You should also keep in mind that semantic definitions change over time, because a cloud provider changes its capabilities, such as new computing resources, or focus. Additionally, you may dynamically change your cloud provider without disruption to the operation of the enterprise application.
  • Privacy, risk and Security: How do you articulate your privacy, risk and security concerns? How do you enforce them? While there are already technology and standards for this, the cloud setting imposes new problems. For example, once you update the encrypted data regularly the cloud provider may be able to determine from the differences parts or all of your data. Furthermore, it may maliciously change it. Finally, the market is fragmented without an integrated solution.
  • Social Network Challenge: Similarly to the semantic challenge, the problem of semantically describing social data and doing efficient analysis over several different social networks exist. Users may also change arbitrarily their privacy preferences making reliable analytics difficult. Additionally, your whole company organizational structure and the (in-)official networks within your company are already exposed in social business networks, such as LinkedIn or Xing. This blurs the borders of your enterprise further to which it has to adapt by integrating social networks into its business applications. For instance, your organizational hierarchy, informal networks or your company’s address book exist probably already partly in social networks.
  • Internet of Things: The Internet of Things consists of sensors and actuators delivering data or executing actions in the real world supported by your business applications and processes. Different platforms exist to source real world data or schedule actions in the real world using actuators. The API Management challenge exists here, but it goes even beyond: You create dynamic semantic concepts and relate your Internet of Things data to it. For example, you have attached an RFID and a temperature sensor to your parcels. Their data needs to be added to the information about your parcel in the ERP system. Besides the semantic concept “parcel” you have also that one of a “truck” transporting your “parcel” to a destination, i.e. you have additional location information. Furthermore it may be stored temporarily in a “warehouse”. Different business applications and processes may need to know where the parcel is. They do not query the sensor data (e.g. “give me data from tempsen084nl_98484”), but rather formulate a query “list all parcels in warehouses with a temperature above 0 C” or “list all parcels in transit”. Hence, Internet of Thing data needs to be dynamically linked with business concepts used in different clouds. This is particularly challenging for SaaS applications, which may have different conceptualization of the same thing.

Enterprise Architecture for Inter-Cloud Applications

You may wonder how you can integrate the above scenario at all in your application landscape and why you should do it at all. The basic promise of cloud computing is that it scales according to your needs, that you can outsource infrastructure to people who have the knowledge and capabilities to run the infrastructure. Particularly, small and medium size enterprises benefit from this and the cost advantage. It is not uncommon that modern startups start their IT using the cloud (e.g. FourSquare).

However, also large corporations can benefit from the cloud, e.g. as a “neutral” ground for a complex supply chain with a lot of partners or to ramp up new innovative business models where the outcome is uncertain.

Be aware that in order to offer some solution based on the cloud you need to first have a solid maturity of your enterprise architecture. Without it you are doomed to fail, because you cannot make proper risk and security analysis, scaling and benefit from cost reductions as well as innovation.

I propose in the following figure an updated model of the enterprise architecture with new components for managing cloud-based applications. The underlying assumption is that you have an enterprise architecture, more particularly a semantic model of business objects and concepts.

intercloudarchitecturenew

  • Public/Private Border Gateway: This gateway is responsible for managing the transition between your private cloud and different public clouds. It may also deploy agents on each cloud to enable a secure direct communication between different cloud platforms without the necessity to go through your own infrastructure. You might have more fine granular gateways, such as private, closest supplier and public. A similar idea came to me a few years ago when I was working on inter-organizational crisis response information systems. The gateway is not only working on the lower network level, but also on the business processes and objects level. It is business-driven and depending on business processes as well as rules, it decides where the borders should be set dynamically. This may also mean that different business processes have access to different things in the Internet of Things.
  • Semantic Matcher: The semantic matcher is responsible for translating business concepts from and to different technical representations of business objects in different cloud platforms. This can be simple transformations of not-matching data types, but also enrichment of business objects from different sources. This goes well beyond current technical standards, such as EDI or ebXML, which I see as a starting point. Semantic matching is done automatically – there is no need for creating time consuming manual mappings. Furthermore, the semantic matcher enhances business objects with Internet of Things information, so that business applications can query or trigger them on the business level as described before. The question here is how you can keep people in control of this (see Monitor) and leverage semantic information.
  • API Manager: Cloud API management is the topic of the coming years. Besides the semantic challenge, this component provides all necessary functionality to bill, secure and publish your APIs. It keeps track how is using your API and what impact changes on it may have. Furthermore, it supports you to compose new business software distributed over several cloud platforms using different APIs subject to continuous change. The API Manager will also have a registry of APIs with reputation and quality of service measures. We see now a huge variety of different APIs by different service providers (cf. ProgrammableWeb). However, the scientific community and companies have not picked up yet the inherent challenges, such as the aforementioned semantic matching, monitoring of APIs, API change management and alternative API compositions. While there exists some work in the web service community, it has not yet been extended to the full Internet dimension as it has been described in the scenario here. Additionally, it is unclear how they integrate the Internet of Thing paradigm.
  • Monitor: Monitoring is of key importance in this inter-cloud setting. Different cloud platforms offer different and possible very limited means for monitoring. A key challenge here will be to consolidate the monitoring data and provide an adequate visual representation to do risk analysis and selecting alternative deployment strategies on the aggregated business process level. For instance, by leveraging semantic integration we can schedule request to semantically similar cloud and business resources. Particularly, in the Internet of Thing setting, we may observe unpredictable delays, which lead to delayed execution of real-world activities, e.g. a robot is notified that a parcel flew off the shelf only after 15 minutes.

Developing and Managing Inter-Cloud Business Applications

Based on your enterprise architecture you should ideally employ a model-driven engineering approach. This approach enables you automation of the software development process. Be aware that this is not easy to do and failed often in practice – However, I have also seen successful approaches. It is important that you select the right modeling languages and you may need to implement your own translation tools.

Once you have all this infrastructure, you should think about software factories, which are ideal for developing and deploying standardized services for selected platforms. I imagine that in the future we will see small emerging software factories focusing on specific aspects of a cloud platform. For example, you will have a software factory for designing graphical user interfaces using map applications enhanced with selected Odata services (e.g. warehouse or plant locations). In fact, I expect soon a market for software factories which enhances the idea of very basic crowd sourcing platforms, such as the Amazon Mechanical Turk.

Of course, since more and more business applications shift towards the private and public clouds, you will introduce new roles in your company, such as the Chief Cloud Officer (CCO). This role is responsible for managing the cloud suppliers, integrating them in your enterprise architecture and proper controlling as well as risk management.

Technology

The cloud exists already today! More and more tools emerge to manage it. However, they do not take into account the complete picture. I described several components for which no technologies exist. However, some go in the right direction as I will briefly outline.

First of all, you need technology to manage your API to provide a single point of management towards your cloud applications. For instance, Apache Delta Cloud allows managing different IaaS provider, such as Amazon EC2, IBM SmartCloud or OpenStack.

IBM Research also provides a single point of management API for cloud storage. This goes beyond simple storage and enables fault tolerance and security.

Other providers, such as Software AG, Tibco, IBM or Oracle provide “API Management” software, which is only a special case of API Management. In fact, they provide software to publish, manage the lifecycle, monitor, secure and bill your own APIs for the public on the web. Unfortunately, they do not describe the necessary business processes to enable their technology in your company. Besides that, they do not support B2B interaction very well, but focusing on business to development aspects only. Additionally, you find registries for public web APIs, such as ProgrammableWeb or APIHub, which are first starting point to find APIs. Unfortunately, they do not feature sematic description and thus no semantic matching towards your business objects, which means a lot of laborious manual work for doing the matching towards your application.

There is not much software for managing the borders between private and public cloud or even allowing more fine-granular borders, such as private, closest partner and the public. There is software for visualizing and monitoring these borders, such as the eCloudManager by Fluid Operations. It features semantic integration of different cloud resources. However, it is unclear how you can enforce these borders, how you control them and how can you manage the different borders. Dome 9 goes into this direction, but focuses only on security policies for IaaS applications. It does only understand data and low level security, but not security and privacy over business objects. Deployment configuration software, such as Puppet or Chef, are only first steps, since they focus only on deployment, but not on operation.

On the monitoring side you will find a lot of software, such as Apache Flume or Tibco HAWK. While these operate more on the lower level of software development, IFTTT enables execution of business rules over data on several cloud providers providing public APIs. Surprisingly, it considers itself at the moment more as a end user facing company. Additionally, you find in the academic community approaches for monitoring distributed business processes.

Unfortunately, we find little ready to go software in the area “Internet of Things”. I worked myself with several R&D prototypes enabling cloud and gateways, but they are not ready for the market. Products have emerged but they are only for a special niche, e.g. Internet of Things enabled point of sale shop. They lack particularly a vision how they can be used in an enterprise-wide application landscape or within a B2B enterprise architecture.

Conclusion

I described in this blog the challenges of inter-cloud business applications. I think in the near future (3-5 years) all organizations will have some them. Technically they are already possible and exist to some extent. The risk and costs will be for many companies lower than managing everything on their own. Nevertheless key requirement is that you have a working enterprise architecture management strategy. Without it you won’t have any benefits. More particularly, from the business side you will need adequate governance strategies for different clouds and APIs.

We have seen already key technologies emerging, but there is still a lot to do. Despite decades of research on semantic technologies, there exists today no software that can perform automated semantic matching of cloud and business concepts existing in different components of an inter-cloud business application. Furthermore, there are no criteria on how to select a semantic description language for business purposes that are as broad as described here. Enterprise Architecture Management tools in this area only slowly emerge. Monitoring is still fragmented with many low level tools, but only few high-level business monitoring tools. They cannot answer simple questions, such as “what if cloud provider A goes down then how fast can I recover my operations and what are the limitations”. API Management is another evolving area, but which will have a significant impact in the coming years. However, current tools only consider low-level technical aspects and not high-level business concepts.

Finally, you see that a lot of challenges mentioned in the beginning, such as the social network challenge or Internet of Thing challenge, are simply not yet solved, but large scale research efforts are on their way. This means further investigation is needed to clarify the relationships between the aforementioned components. Unfortunately, many of the established middleware vendors lack a clear vision of cloud computing and the Internet of Things. Hence, I expect this gap will be filled by startups in this area.

The Future of 3D Printing

3D Printing has become a hot topic over the last years, although it roots can be found already 30 years ago. Analysts say it will become a billion dollar industry in the near future. This blog investigates what is exactly new and what we can envision for the future. I start with an overview of 3D printing. Then I describe what companies exist and what can be printed. Afterwards, I will continue with an analysis of the impact on the manufacturing supply chain and differentiation from other supply chain management concepts. Finally, I conclude with a future vision.

What is it about?

As the name indicates, 3D Printing is about printing 3D (width, height, length) objects made out of solid material. It is also called additive manufacturing, because a 3D object is created by adding layers on top of each other. This layer structure can be digitally designed using CAD (Computer Aided Design) software or by using 3D models derived from real objects by using 3D scanners. Different materials can be used within the layers, but at the moment, they cannot be arbitrarily combined. Nevertheless, a wide range of objects have been created using 3D printing technology. At the moment, it is used mostly for design prototypes, but also implants. However, it is an open question if 3D printing is less costly than other types of manufacturing.

3D printing is a young market with only few existing companies and open source/crowdsourcing organizations. It may be an opportunity for 2D companies to leverage existing Intellectual Property to extend their business to new markets beyond the shrinking 2D printing market. However, new legal problems are likely to appear in these new markets.

Other business opportunities can be seen around the whole 3D printing lifecycle from mining of raw material via printing of 3D objects to recycling as well as reuse of 3D objects.

What companies exist?

There is not one big company offering 3D printers or printing services. We find several smaller ones, but among those the most popular ones are probably:

  • Stratasys
    • Services: They offer various consulting services to optimize the use of their technology in different domains. Furthermore, you can send them digital 3D models and let them print the 3D objects.
    • Products: Mostly printers for prototypes. There seems to be no mass production facility for end user products. Besides printers, they offer 3D production systems, which create stronger products, i.e. they are more stable. Additionally, they can utilize more material as input for the printing process.
  • 3D Systems

We observe that they offer services similar to the 2D printing industry. For instance on-demand printing. This is useful if you (1) do not have a 3D printer or (2) you need to print something using different materials or (3) if you run out of material for printing.

Other services seem to be more unique to the 3D printing industry, such as the market place for 3D models.

However, there have been also some open source/crowd sourcing initiatives for 3D printers or 3D objects. This includes also printers that can print themselves or print upgrades to their hardware. The popular open source printers by RepRap start from $500. There are also open source market places for 3D objects, such as Thingiverse.

What materials can be used?

In the beginning most 3D printers were able to print objects made out of plastic. This is still today mostly the case. This material is especially useful, because it is and can be used in a wide range of products.

Other materials include, but are not limited to:

  • Metal
  • Copper
  • Steel
  • Silver-filled polymers
  • Organic material

Obviously, the different materials require different printing technologies. Furthermore, organic material needs to be cultured after the printing process. Hence, it is currently debated in science and practice what should be labeled as 3D printing and what not. As far as I know, the aforementioned 3D printing companies are not able to print objects consisting of organic material.

What can be printed?

The limit is just the sky. However, you need to take into account what quality parameter you require. For example, what resolution you need and how stable the object should be. Another thing is that it is still tricky to print complex electronic parts or to assemble them afterwards into another complex object (e.g. a smartphone). Apparently, it is still less costly and has a better quality if you are using human labor. Nevertheless, it should not be expected that you will hit in the future just the print button and you can print your own Airbus A380. Complex objects need always be modular and parts of them may need to be replaced for maintenance reasons. This type of work may be done in the future by robots.

Examples of what can be printed and what has drawn attention in the press:

How does it impact the supply chain?

Until now I was just describing one part of the supply chain – the production process. However, in order to create new business opportunities, one must understand the whole supply chain, which is at the moment not as well understood as the production process. Generally one can distinguish the following phases (cf. also SCOR):

  • Sourcing: This probably the one process which is most similar to the existing sourcing processes.
  • Making: This will be done by the 3D printers. You may design the 3D objects using the tools and technologies mentioned in the beginning.
  • Delivering: Here, we may want to think about new business models. Should the customer print, for example, his new iPhone at home? Should there be a 3D printing shop that has all the material available and that is able to finance printers to make industry-grade products?
  • Recycling: This is somehow similar to the existing recycling processes. However, there needs to be a big incentive for customer and manufacturer to recycle. The open question is: Can we design components, such as smartphones, that can be printed for recycling? Do we need new technologies to separate again the end product into its source materials?
  • Upgrading: This is a new process and does not exist as such in the hardware industry, but it is well known in the software industry. The idea is that instead of throwing away or recycling your old hardware you just print an upgrade! This is already possible today with certain products (cf. RepRap). Imagine you have a smartphone, e.g. the Samsung Galaxy S2 and want to upgrade it to a new version, e.g. the Samsung Galaxy S3. What about instead of recycling it you go to your Samsung store and they print you the upgrade? Can we design systems that are able to do this? What are the limitations? How can we design for a managed evolution of systems?

Future Vision

Currently, I have not seen so many big 2D printing companies going into the 3D market. An exception is HP, which uses the technology from Stratasys. Some may think this an obvious step, because they could potentially reuse some of their intellectual property and it could help them to find new markets beside the struggling 2D printing market. However, also 3D printing is already more than 20 years old (cf. for example patents by Stratasys), it has not yet become as important as 2D printing (was). I think we need to develop new business models (cf. previous section) to make it more successful and alternative models less attractive (e.g. cheap human labor in developing countries).

Thinking further, we need also to consider threats to 3D printing. At the moment, it is mostly used for prototyping. I wonder if we could not just use Holographic technologies together with simulation. We would not even need to waste material and can quickly change the prototype. The demonstration capabilities may become similar to real 3D objects in the future. This could also mean higher productivity in comparison to 3D printing. Obviously, I won’t expect to fly in a holographic plane or drive a holographic car the next decades. Nevertheless, 3D printing has not been used for these use cases beyond prototyping, i.e. it has not yet proven useful for mass production.

From a more research point of view, we should investigate how we can design components that can be upgraded using 3D printing systems. For instance, how can we avoid to be constrained too much by the initial design of the system and still be flexible to allow upgrades that have commercial success in a few years? How can we manage the evolution of systems that are able to replicate? When do we need to make a decision to recycle the whole system and start from scratch again? We may find some answers in the software industry, but I do not think they will be sufficient.

Last, but not least, we need to consider societal challenges. Production may not be outsourced anymore to developing countries. We will see more demand for highly skilled people designing 3D objects. How do we handle scarce resources, if everybody can print anything he or she wants to have? Finally, as described above, there is the possibility for everyone to print weapons – even using today’s technology. Politicians, researchers and society itself need to find answers to deal with this.

The Emergence of Rule Management Systems for End Users

Long envisioned in research and industry, they have silently emerged: rule management systems for end users to manage their personal (business) processes. End users use these systems to define rules that automate parts of their daily behavior to have more time for their life. I will explain in this blog entry new trends and challenges as well as business opportunities.

What is it about?

Our life is full of decisions. Let it be complex unique decisions, but also simple reoccurring ones, such as switching off the energy at home when we leave it or sending a message to your family that you left work. Another more recent example is taking a picture and uploading it automatically to Facebook and Twitter. Making and executing simple decisions take our time, which could be used for something more important. Hence, we wish to automate them as much as possible.

The idea is not new. More technical interested people have used computers or home automation systems since several years or even decades. However, they required a lot of time to learn them, complex to configure / required software engineering skills, were proprietary and it was very uncertain if the manufacturer will still support them in a few years.

This has changed. Technology is part of everybody’s life, especially since the smartphone and cloud boom. We use apps for our smartphone or as part of the cloud that help us automating our life and I will present in the following two apps in this area. These apps can be used by anyone and not only software engineers.

What products exist?

Ifttt

The first app that I am going to present is a cloud-based app called ifttt (if this then that), which has been created by a startup company. You can use a graphical tool to describe rules for integrating various cloud services, such as Facebook, Dropbox, Foursquare, Google Calendar, Stocks or Weather. These cloud services are called channels in ifttt.

A rule has the simple format “if this then that”. The “this” part refers to triggers that start actions specified by the “that” part. The ifttt application polls the channels every 15 minutes if they have any new data and evaluates if new data triggers actions specified in the “that” part.

Defining rules is an easy task, but a user does not necessarily have to do this. A huge community has already created a lot of so-called recipes, which are predefined rules that can be just used by any other user without any effort. Examples for these recipes are:

On(X)

On(X) is a mobile and cloud application created by Microsoft Research. It leverages sensor information of your mobile for evaluating rules and triggering actions. Contrary to the aforementioned app it is thus not limited to data from other cloud platforms.

Rules are described using a JavaScript language (find some examples here). This means end users cannot use a graphical tool and have to have some scripting language knowledge. Luckily, the platform offers also some receipts:

We see that the receipts are more sophisticated, because they can leverage sensor information of the mobile (location/geofencing or movement).

What is in for business?

Using these personal rule engines for business purposes is an unexplored topic. I expect that they could lead to more efficiency and new business models. For example, consider the following business model:

Furthermore, we can envision improved service quality by using personal rule engines in various domains

  • Medicine: if user is in the kitchen then remind to take medicine, if it has not been taken yet for the day
  • Energy saving: If I leave my house then shut down all energy consumers except the heater.
  • Food delivery: If I am within 10 km range of my destination then start delivering the pizza I have ordered
  • Car sharing: If I leave work send a SMS to all my colleagues I share my car with
  • Team collaboration: We can evaluate if team members or members of different teams want to do the same actions or waiting for similar triggers. They can be brought together based on their defined rules to improve or split their work more efficiently.

Future research

The aforementioned applications are prototypes. They need to be enhanced and business models need to be defined for them. First of all, we need to be clear about what we want to achieve with automating simple decisions, e.g.

    • Cost savings
    • Categorizing for quicker finding and publishing information
    • Socializing

An important research direction is how we could mine the rules, e.g. for offering advertisement or bringing people together. Most of the mining algorithms to day focus on mining unstructured or unrelated data, but how can we mine rules of different users and make sense out of them?

Another technical problem is the time between rule evaluation and execution. For instance, ifttt only polls every 15 minutes its data sources to check if actions in a rule should be triggered. This can be too late in time critical situations or can lead to confusing actions.

From a business point of view, it would be interesting to investigate the integration of personal rule management into Enterprise Resource Planning (ERP) systems as well as to provide a social platform to optimize and share rules.

Finally, I think it is important to think about rules involving or affecting several persons. For example, let us assume that a user “A” defined the rule “when I leave work then inform my car sharing colleagues”. One of the car sharing colleagues has the rule “when a user informs me about car sharing then inform all others that I do need a seat in a car anymore”. If user “A” now cannot bring the car sharing colleague home then he or she has a problem.

A more simpler example would be if user “A” defines a rule “If user ‘B’ sends me a message then send him a message back” and user ‘B’ defines a rule “If user ‘A’ sends me a message then send him a message back”. This would lead to an indefinite message exchange between the users.

Here we need to be able to identify and handle conflicts.

Research Challenges for Case Management

I describe in this blog entry research and innovation challenges for case management from a business and technical perspective. This blog entry complements my existing blog entries on case management: introduction & standards, what constitutes a case and system concepts.

This entry is relevant for managers, business analysts, enterprise architects, researchers and developers.

Business Challenges

I have explained before that case management is about dynamically evolving processes. We want to manage them in a structured way, because we expect an improved process execution in terms of quality, better cost management and enhanced innovation potential. To leverage these benefits, we need to determine the processes that should be subject to a case management approach and we need to be able to compare as well as integrate different cases.

I presented already in a previous blog guidelines for modeling cases, so that they can deliver a business benefit. However, it is still an open issue how benefits can be realized around the case lifecycle and what cases/processes are suitable for case management. A good starting point is the guide for business processes in [1]. Although it is about managing, deploying and monitoring more standardized business processes, it gives useful hints that are also valid for case management. This includes deployment, monitoring/optimization and learning aspects.

Obviously, some processes are more suitable for case management than other processes. For example, processes, which are always executed similarly and are highly standardized with predictable exceptions, should be managed using business process management techniques (e.g. workflow systems or six sigma). The issue here is that cases are not necessarily executed similarly, but we wish to be able to compare cases, so that we can learn from them and foster innovation. For example, Six Sigma cannot be transferred directly to case management. Furthermore, this is needed to enable quality and cost management.

Additionally, we may need to think about new roles governing the management of cases. I gave in a previous blog the example of a global business rule designer that ensures consistency of business rules among cases, but still allowing for individual case rules.

Nevertheless, it is still important to glue the outputs generated in different cases together to form the big picture of the enterprise and steer it into the right direction. Case management can offer here the right flexibility to act top-down and bottom-up. However, it is still unclear how different cases can and should be linked, especially on the inter-organizational level, where we have different rules, cultures and regulations of the involved organizations.

Another interesting aspect is the fact that case management offers also the new innovative models for paying employees. For instance, workers could be paid based on the complexity and approach they took to solve a case. This can be determined by manager and customer together. There is no requirement anymore to stay for a certain time in the office. Case workers can solve cases wherever they are and whenever they want. They can work as much as they require for satisfying their needs.

Finally, I expect that case management can be a building block of a solution for the management of personal processes, such as founding a company, buying a house or marriage.

Technical Aspects – Case Management Engine

The case management engine should enable the users to model, execute and monitor dynamically evolving processes in a structured manner. It supports the user to create case objects and define processes as well as rules for them. The created models can be verified by the engine for correctness, i.e. that the case could be executed without violating any rules or processes (see also [2]).

Rules and processes maybe enforced, but for dynamic processes it makes often more sense to detect deviations from rules and processes to evaluate their impact later (cf. [2]).

While a lot of rule and process formalisms are known (cf. [2] for an overview), it is still an open challenge, which one should be used and which one makes sense for a given case. Here, we need to evaluate and compare case management solutions in a real company setting.

Technical Aspects – Graphical User Interface/Visualizations/Pervasive Interactions

Another question is how the case workers and their enterprises can get the maximum benefit from case management. By following a structured approach, we expect that the case workers can make more sense out of cases and dynamically evolving processes, so that they can react better to a given environment. This requires new visualization techniques showing the case evolution, so that it is clear what has been done, what is currently going on and what are the next steps.

However, recent developments show that the workforce does not want and need to sit in an office all day long. They need to be at the customer site, doing sports, staying with their family or simply want to enjoy the world. This means that we have to support their contribution to cases wherever they are. Novel solutions need to be designed, so that they can provide their input to a dynamic case process at the right time, at the right place and using any device (e.g. screens, walls, voice or gestures). This also includes proactive recommendation of case objects to the case worker (e.g. based on expertise, skills or previous case executions). Appropriate recommendation algorithms considering also long-term aspects need to be invented.

Obviously, we need to integrate our existing collaboration and communication components (e.g. voice chat, collaborative text editing or version control management systems). I observe a lot of new development related to novel Web standards supporting this (e.g. OpenSocial or Web Intents). Nevertheless, there is still some research needed on how we can leverage these emerging standards.

Technical Aspects – Inter-organizational Distributed Level

I think the real challenge with case management is to support cross-organizational cases and dynamic processes. Business Process Management has terribly failed in this area – not only technically, but also from a business perspective. However, if we are able to manage it right then we can also expect a lot of benefits from it.

From a technical perspective, we need to consider that organizations working on one case cannot and do not have a complete overview on the case due to privacy, regulatory or strategic reasons. Furthermore, an inter-organizational case has to be embedded in the different environments (e.g. business goals or regulatory rules) of the organizations.

This also implies that a case is distributed over potentially several organizations that work on parts of it concurrently within their given environment consisting of business rules, artifacts, organizational structures and processes (e.g. an invoice is part a supplier and consumer case). Research in the area of distributed systems has shown that this can quickly lead to a diverging view on activities, artifacts, rules and data. Clearly, this is undesired, because it introduces coordination problems and cases management won’t deliver its benefits. Thus, case management systems have to provide a converging view, i.e. a common picture, on the inter-organizational cases. However, classical synchronization and transaction mechanisms in distributed systems do not scale well to this inter-organizational level and do not deliver what the users expect. Novel mechanisms need to be designed and tested (cf. also [2]).

Conclusion

I have presented in this blog entry several innovation and research challenges from a business as well as technical perspective. These challenges have not been solved yet adequately, but I see continuous improvement of these issues, so it can be expected that they will be addressed by consultancies and research organizations.

Stay tuned for my next blog entry where I analyze limitations of existing open source solutions with respect to case management.

References

[1] Becker, Jörg; Kugeler, Martin; Rosemann, Michael (Eds.): Process Management: A Guide for the Design of Business Processes, Springer, 2011, ISBN 978-3642151897

[2] Franke, Jörn: Coordination of Distributed Activities in Dynamic Situations. The Case of Inter-organizational Crisis Management, PhD Thesis (Computer Science), English, LORIA-INRIA-CNRS, Université de Nancy/Université Henri Poincaré, France, 2011.