Batch-processing & Interactive Analytics for Big Data – the Role of in-Memory

In this blog post I will discuss various aspects of in-memory technologies and describe how various Big Data technologies fit into this context.

Especially, I will focus on the difference between in-memory batch analytics and interactive in-memory analytics. Additionally, I will illustrate when in-memory technology is really beneficial. In-memory technology leverages the fast main memory and processor caches to deliver superior performance.

While initially the deployment of in-memory technology seems to be attractive, companies have to carefully design how they use the scarce resource memory for big data sets, because the amount of data tends to grow when the company masters successfully Big Data technologies. For instance, they need to think about the issue of memory fragmentation, scheduling, capacity management, the decision how the data should be structured in-memory or making the decision about what data should be represented in-memory.

I will explain that some paradigms introduced for non-in-memory analytics, such as the paradigm that it is better if you do not need to read data than reading it all, is still very valid for in-memory technologies.

Finally, I will give an outlook on current Big Data technologies and their strength and weaknesses with respect to in-memory batch analytics and interactive in-memory analytics.

The Concept of In-Memory

The concept of in-memory became more and more popular around 2007/2008, although the fundamental concepts behind it exist since decades. It was marketed quite heavily by SAP and its HANA in-memory database at this time.

Around the same time, a different paradigm appeared, the concept of distributed Big Data platforms.

In the beginning, both were rather disconnected, where in-memory technologies relied on one “big” machine and distributed data platforms consisted out of a huge set of different more commodity-like machines. In-memory was at this time often associated with interactive queries with fast responses for comparable small datasets fitting in-memory on machine and Big Data platforms for long-running analytics queries crunching large volumes of data scattered over several nodes.

This changed recently. Particularly, in-memory techniques have been brought to long-running analytics queries and distributed Big Data platforms to interactive analytics. The assumed benefit for both cases is that more data can be handled in more complex ways in a shorter time.

However, you need to carefully look what kind of business benefits you can gain from doing faster or more analytics.

Public sector organizations over various domains have significant benefits, because their “ROI” is usually measured in non-monetary terms as benefits for society. A faster, fair and more transparent or scientifically correct analysis can be one example of such a benefit. Additionally, supervision of the private sector need to be on the same level as the private sector.

Traditional private sector organizations on the other hand will have to invent new business models and convince the customer. Here, new machine learning algorithms on large data volumes are more beneficial in comparison of traditional data warehouse reports. Internet Industries including the Internet of Things and autonomous robots obviously have some benefits let it be the processing of large data volumes and/or the need to react quickly to events in the real world.

The Difference between in-memory batch processing and interactive analytics

Often people wonder why there is still a difference between batch processing and interactive analytics when using in-memory. In order to answer this question let us quickly recap the difference between the two:

  • Distributed big data processes: They are long-running because they need to query data residing on several nodes and/or calculations are very complex requiring a lot of computing power. Usually they make calculation/processed data available in a suitable format for interactive analytics. Usually, these processes are planned and scheduled in advance.
  • Interactive analytics: These are often ad-hoc queries from low to very high complexity. Usually it is expected that they return results within seconds or minutes. However, they can also take much longer and are then candidate for distributed big data processes, so that results are precomputed and stored for interactive analytics. Interactive analytics go beyond standard tables to return results faster.

The results of them can be either used by humans or by other applications, e.g. applications that require prediction to provide an automated service to human beings. Basically both approaches fit to the Lambda architecture paradigm.

In-memory technologies can basically speed up both approaches. However, you need to carefully evaluate your business case for this. For example, it make sense to speed up your Big Data batch processes to finish before your people start working or to have more time to do perform additional processes on the same resources – This is particularly interesting if you have a plethora of different large datasets where different analytics can make sense. With respect to interactive analytics, you benefit most if you have specific analytics algorithms that benefit from memory locality, e.g. iterative machine learning algorithms.

If you have people working on large tables using aggregations then you should make them aware that it make more sense to work with samples, in-memory indexes and data structures as well as high parallelism. Aggregating data of a large table in-memory is very costly and the speed difference to tables on disk is most likely not much. The core paradigm should be here: do not read what is needed.

To make it short: Be aware of your priorities to leverage speed-ups by using in-memory technology. Not everything has to be in-memory.

Nevertheless, you should first leverage all possible optimizations without using in-memory technology. An inefficient data structure on-disk is not a better structure if it is in-memory. Additionally, you should think about how much data you need and how precise your results need to be. As I wrote in a previous blog post, this can save you a lot of time that you can use to perform further analytic tasks.

We will in the following describe some challenges with in-memory that you need to tackle to be successful with in-memory technologies.

Challenges with in-memory

Memory fragmentation

Problem

Memory fragmentation does not only occur with in-memory technologies, but on any storage. You can have internal fragmentation, where you allocate more memory to an application than needed or external fragmentation, where you deallocate memory, but new data does not fit into the deallocated memory and you have to use additional memory.

However, it can be rather problematic with in-memory technologies because main memory is usually the smallest storage available. In the context of Big Data, where there can be a huge diversity of different data sets that grow from time to time as well as different algorithms that use memory in a different way, this becomes much quicker apparent as if there would be just one static data set that does not change and is always processed the same way.

The issue here with memory fragmentation is that you have less memory than physically available – potentially a lot less. This leads to unexpected performance degradation and the need to spill over to slower disk space to continue the computation, which may lead to thrashing.

You cannot avoid memory fragmentation, because one cannot look into the future when which data set is loaded and what computation is needed.

Solution

As a first step to handle memory fragmentation is to have a list of processes and interactive queries that are regularly executed and to look at them to see any potential issues with memory fragmentation. This can be used during monitoring to be aware of memory fragmentation. One indicator can be that the available memory does not match the memory that should be consumed. Another indicator can be a lot of spills to disk.

There are several strategies to handle identified memory fragmentation. In case of in-memory batch processes, one should release all the memory after the batch process have been executed. Furthermore, one should use distributed Big Data technologies, which usually work with fixed block sizes from the distributed file system layer (e.g. HDFS). In this case you can partially avoid external fragmentation. You can avoid it only partially, because many algorithms have some temporary data or temporary relevant data which needs to be taken into account as well.

If you have interactive analytics, a very common recommendation even by vendors of popular memcache solutions is to restart the cache from time to time and thereby forcing to reload the data in an ordered manner into cache avoiding fragmentation. Of course, once you add, modify, remove data you have again some fragmentation, which will grow over time.

Another similar approach is called compaction, which exist in traditional relational databases and big data systems. Compactation reduces fragmentation that occurs due to updates, deletion and insertion of new data. The key here is that you can gain performance benefits for your users if you schedule it to time where the system is not used. Surprisingly, often people do not look at compaction, although it has significant impact on performance and memory usage. Instead they rely only on non-optimal default settings, which usually not for large scale analytics, but smaller scale OLTP usage. For instance, it can make sense for large scale analytics to schedule compaction after loading all the data and no new data is arriving before the next execution of a batch processing process.

What data should be in-memory? About temperature of data…

The Concept

It is not realistic to have all your data in-memory. This is not only due to memory fragmentation, but also costs for memory, fault-tolerance, backups and many others. Hence, you need an approach to decide which data should be in-memory.

As already described before it is important to know your data priorities. Quite often these priorities change, e.g. new data is introduced, or data simply becomes outdated. Usually it is reasonable to expect that data that is several months or years old will not often be touched, expect for research purposes. Here is where the temperature of data, i.e. hot, warm and cold data comes into play.

Hot data has been used recently quiet frequently and is likely to be used quiet frequently in the near future.

Warm data has been used recently not as frequently as hot data and it is NOT likely to be used frequently in the near future.

Cold data has not been used recently and is not likely to be used in the near future.

Usually hot data resides on CPU caches and mainly on main memory. Warm data resides mainly on local disk drives and only a small fraction in main memory. Cold data resides mostly on external slow storage potentially accessed via the network or in the cloud.

Managing Temperature of Data

The concept of temperature of data applies to batch processes and interactive analytics equally. However, you need to think about what data needs to be kept hot, warm and cold. Ideally this happens automatically. For example many common in-memory system provide the strategy LRU (last recently used) to automatically move hot data to warm data and eventually to cold data and the other way around. For instance, Memcached or SAP HANA support this as a default strategy.

This seems to be a good default strategy, especially if you cannot or do not want to look into more detail about the temperature of data. Indeed, it has also some sound assumptions, since it is based on the principal of locality, which is also key to distributed Big Data processes and many machine learning algorithms.

However, there are alternative strategies to LRU that you may want to think about:

  • Most recently used (MRU): The most recently used memory element is moved to warm and eventually to cold storage. This assumes that there is stable data that is more relevant than having the newest data.
  • Least frequently used (LFU): The data which has been least frequently used will be moved to warm storage and eventually to cold storage. The advantage here is that recently used data that has been only accessed once is quickly moved as well as data which has been accessed quiet frequently, but not in the near past, will stay in-memory.
  • Most frequently used (MFU): The data which has been most frequently used in the past will be moved from warm storage and eventually to cold storage. The idea here is that the more data has been used the less valuable it will be and hence will be accessed much less in the near-future.
  • Any combination of the above

Obviously, the most perfect strategy would predict what data would least be used in the future (“Clearvoyant algorithms”) and move data accordingly to hot, warm, and cold storage. This is of course not exactly possible, but a sound understanding on how people use your Big Data platform can come pretty close to that ideal.

Of course, you can implement also more sophisticated machine learning algorithms that take into account the environment to predict what data and computation will be required in the future given a new task (cf. here for an approach for scheduling multimedia tasks in the cloud based on machine learning algorithms – the use case is different but the general idea the same). Unfortunately, most of the popular Big Data and in-memory solutions do not implement such an approach yet.

How should the data be structured?

Many people, including business people, have only the traditional world of tables, consisting of rows or columns, in mind when using data. In fact, a lot of analysis is based on this assumption. However, while tables are simple they might not be the most efficient way to store data in-memory or even to process it.

In fact, depending on your analysis different formats make sense, such as:

  • Use the correct data type: If you have numbers use a data types that support numbers, such as integer or double. Dates can often be represented as integers. This requires less storage and the cpu can read an integer represented as integer in magnitudes faster than an integer represented as a string. Similarly even with integer, you should select an appropriate size. If your numbers fit into a 32-bit integer then you should prefer storing it as 32-bit instead of 64-bit. This will increase your performance significantly. However, the key message here is store the data with the right data type and use the available ones and understand their advantages as well as limitations.
  • Column-based: Data is stored in columns instead of rows. This is usually beneficial if you need to access one or more full columns of a given data set. Furthermore, it enables one to avoid reading data that is not relevant using storage indexes (min/max) or bloom filters.
  • Graph-based: Data is stored as so-called Adjacency lists or sometimes as Adjacency matrices. This shows much more performance than row-based or column-based storage with respect to graph algorithms, such as strongly connected components, shortest path etc. These algorithms are useful for social network analytics, financial risks of assets sold by different organizations, dependencies between financial assets etc.
  • Tree-based: Data is stored in a tree structure. Trees can be searched usually comparable fast and is often used for database indexes to find out in which data block a row is stored.
  • Search indexes for terms in unstructured text. This is usually useful for matching data objects, which are similar, but do not have unique identifiers. Alternatively, they can be used for sentiment analysis.Traditional database technology shows, for example, a terrible performance for these use cases – even in-memory.
  • Hash-Clustering Indexes: This can be used in columns stores by generating a hash out of the values of several columns for one row. This hash is stored as another column. It can be used for quickly searching for several criteria at the same time by using only one column. This reduces the amount of data to be processed at the expense of additional storage needed.

Furthermore, the same data might be stored in different formats on warm or cold storage, meaning that you have to decide if you want to have redundant data or generate each time from scratch the optimal storage of data for a given computation.

Compression can make sense for data in-memory, because it enables storing more data in-memory instead of slower disk drives.

Unfortunately, contrary to the strategies to manage data temperature, there are currently no mature strategies to support you automatically how to store the data. This is a manual decision and thus requires good knowledge how your Big Data platform is used.

Do we always need in-memory?

With the need for processing large data sets some things became apparent: It is even with new technologies, such as in-memory or Big Data platforms, sometimes very inefficient to process data by looking at all of the data – it is better not to read data at all!

Of course, this means you should not read not-relevant data. For instance, it was very common in traditional databases to read all the rows to find out matching rows according to a query. Even when using indexes, some irrelevant data is read when scanning the index, although storing the index as a tree structure increased search performance already a lot.

More modern solutions use storage indexes and/or bloom filters to decide which rows they need to read. This means they can skip blocks of data where the rows not matching to a query are not contained (cf. here for implementation in Apache Hive).

Similarly, probablistic data structures, such as Hyperloglog or data based on sampling (cf. here) enables one to avoid reading all the data again or at all. In fact, here you can even skip “relevant” data – as long as you read enough data to provide correct results within a small error margin.

Hence, even with in-memory technologies it is always better to avoid reading data. Even if the data is already in-memory, the CPU needs more time the more data it has to process – a simple but often overlooked fact.

The impact of Scheduling: Pre-emption

Once your Big data platform or in-memory platform grows, you will not only get more data, but also more user working on it in parallel. This means if they use interactive queries or schedule Big Data processes then they need to share resources, including memory and CPU. Especially when taking into account speculative execution. As described before, you ideally have a general big picture on what will happen, especially with main memory. However, in peak times, but for some Big Data deployments also most of the time, the resources are not enough, because of cost or other reasons.

This means you need to introduce scheduling according to scheduling policies. We briefly touched this topic before, because the concept of temperature of data implies some kind of scheduling. However, if you have a lot of users the bottleneck is merely the number of processors that process data. Hence, sometimes some analytics by some users are partially interrupted to make some resources free for other users. These users may potentially use different data sets meaning that some data might be moved also from main memory to disk drives. After the interrupted tasks are resumed they may need to reload data from disk drives to memory.

This can make performance experience sometimes unpredictable and you should be aware of it so you can react properly to incidents created by users or do a more informed capacity management.

Big Data technologies for in-memory

In-memory batch processing

There are several in-memory batch processing technologies for Big Data platforms. For example, Apache Spark or Apache Flink. In the beginning, these platform especially Spark, had some drawbacks by representing everything as Java-Objects in memory. This would mean, for instance, a 10 character String would consume 6 times more memory then representing it as an array of bytes.

Luckily this changed and data is now stored in-memory in a columnar fashion supporting also to skip data on disk that is not relevant (via predicate pushdown and an appropriate disk storage format, such as ORC or Parquet).

Additionally, both support graph batch processing and processing of continuous streams in-memory. However, both rely on a common abstraction for a data structure in which they represent other data structures, such as graphs. For example, in case of Spark it is Resilient Distributed Datasets (RDD)/dataframes). This means they have not as much performance as a highly specialized graph engine, but they are more generic and it is possible to integrate them with other data structures. For most of the current use cases it is sufficient.

Additionally, different processing algorithms, mainly in the area of machine learning are supported.

Sometimes you will see that they are also advertised as interactive platforms. However, this is not their core strength, because they do not support, for example, the concept of data temperature automatically, i.e. the developer is fully responsible to take into account hot, warm, cold data or to implement a strategy as described above. Additionally, they do not provide index support for data in-memory, because this is usually much less relevant for batch processes. Hence, if you want to use these technologies for inter-active analysis you have to develop some common IT components and strategies how to address temperature of data and the do not read irrelevant data paradigm.

In any case you have to think about scheduling strategies to optimize your resource usage of your available infrastructure.

Depending on your requirements, in-memory batch processing is not needed in all cases and your big data platform should support both: in-memory batch processes, but also non in-memory batch processes to be efficient. Especially, if your batch process only loads as well processes once the data without re-reading parts of the data then you won’t benefit a lot from in-memory.

Interactive in-memory analytics

There are several technologies enabling interactive in-memory analytics. One of the older – but still highly relevant – ones is memcached for databases. Its original use case was to speed up web applications accessing the database with many user accessing, i.e. writing and reading, in parallel the same data. Similar technologies are also used for Master Data Management (MDM) systems, because they need to deliver and receive data from a lot of sources to different systems as well as business processes with many users. This would be difficult if one relies only on databases.

Other technologies focus on the emerging Big Data Platforms based on Hadoop, but also augment in-memory batch processing engines, such as Spark. For instance, Apache Ignite provides functionality similar to memcached, but also supporting Big Data platforms and in-memory batch processing engines. For example, you can create shared RDDs for Spark. Furthermore, you can cache Hive tables or partitions in-memory. Alternatively, you can use the Ignite DataGrid to cache selected queries in-memory.

These technologies support advanced in-memory indexes (keep in mind: it is always better not to read data!) and automated data temperature management. Another example is Apache Tachyon.

There are also very specialized interactive in-memory analytics engines, such as TitanDB for graphs. TitanDB is based on the Tinkerpop graph stack including the interactive Graph query (or graph traversal) language Gremlin. SAP HANA is a specific in-memory column database for OLTP, OLAP, text-analytics and graph applications. It has been extended to a full application stack cloud platform based on in-memory technology.

Taking into account scheduling is much more tricky with interactive analytics, because one does not know what the users exactly will do and prediction engines of user behavior for interactive analytics are currently nearly non-existing.

However, you can define different categories of interactive analytics (e.g. simple queries, complex queries, machine learning, graphs ….) and determine your infrastructure as well as its configuration based on these categories.

Conclusions

It makes sense to distinguish between in-memory batch processes and in-memory analytics. In-memory batch processes can be planned and scheduled easier in advance. Additionally, one can better optimize resources for this. They also are more focused towards processing all data. Specific technologies for distributed in-memory Big Data exists and are complementary to technologies for interactive in-memory analytics. The main difference are additional indexes and automated support for the concept of data temperature.

Even for in-memory technology the key concept of Big Data to not read data that is not relevant is of high importance. Processing terabytes of data in-memory even though only a subset is relevant is a waste of resources and particularly time. This is specially difficult to handle for interactive in-memory analytics where the user can do what they want. Hence, automated and intelligent mechanisms to support this are highly desirable. They should be preferred to manual developing the right data model and structures.

Another key concept is to have the right data structure in-memory for optimal processing. For instance, graph structures show much more performance in comparison to relational row or column-based structures that need to be joined very often to perform graph algorithms. Furthermore, probabilistic data structures and probabilistic sampling queries have a growing importance. Depending on your needs you might have the same data represented redundant in different data structures for different analysis purposes.

Finally, distinguishing interactive analytics and batch processing is not always that straight forward. For instance, you can have a batch process running 5 minutes, but the results are queried 1000 times and thus avoiding each time 5 minutes run time can be very beneficial. On the other hand you can have an interactive query by one user which takes 60 minutes, but it is only needed by one user once. This may also change over time, so it is important that even after development of a solution that you monitor and audit it regularly by business users and technical users to check if the current approach still make sense or another approach makes more sense. This requires a regular dialogue even after go-live of a Big Data application.

The convergence of both concepts requires more predictive algorithms for managing data in-memory and for queries. These can be seen only in their first stages, but I expect much more over the coming years.

DevOps for your business? – About Uniting Development and Operations

DevOps has become in recent years a term for a new paradigm of integrating and managing development as well as operations of software within and cross organizations. I will describe in this blog entry what DevOps is and relate it to existing methodologies, such as agile development, and organizational structures. Basically, DevOps is a broad term that summarizes a set of best practices supported by a high degree of automation of all development and operational processes around the software delivery process using advanced public and/or private cloud technologies. I will conclude with a brief summary of the impact of DevOps on Big Data applications.

The Situation

Many companies have separate development and operations departments. Both usually work under high pressure to deliver and operate applications for business processes of strategic importance.

In recent years it has been shown that the development department has to work closer with internal and external customers to develop the right solutions that the customer can accept. Agile methodologies have been advocated to manage problems given the uncertainty of the business environment where the future solutions should be deployed and/or the lack of understanding of customer requirements or IT requirements. These agile methodologies broke with the paradigm to have a clearly defined long-term process (e.g. waterfall model) where the customer is only involved in the beginning and – often – too late in the end, so that mistakes where costly to correct.

At the same time, the operations department faced similar challenges as the development department. Given the uncertainties of the business environment, the customer ask a lot of new services and IT infrastructure changes, but there was little understanding on the customer side what effects they have and created high pressure on the operations department which has usually a low budget to deal with these changes. Hence, a clearly structured and governed process was needed to handle customer requests. Thus popular IT Service Management frameworks, such as the Information Technology Infrastructure Library (ITIL), were born and used globally. Since critical business applications are operated, the operations department tends to be much more risk avers and it tries to avoid unknown technologies.

It can be observed that both departments had needed and implemented different approaches to deal with the customers. This has lead to the problem that both departments were not only only divided from an organizational perspective, but also from a cultural one. For instance, the development department developed with little consultation of the operations department technology and after some time they just threw a complex piece of software over the “fence” and told the operations department to operate it. However, since they did not collaborate a lot there have been a lot of (extremely) costly problems during software delivery and operations. For example, the different environments, such as development or test environments, did not match the required infrastructure. Newest updates to fulfill requirements could not be installed fast enough, so that development was delayed. Operational staff required training and this was not considered. Many more problems occurred, because there was a strong interdependency between both departments due to the software delivery process.

Recently, DevOps have been pushed to handle this lack of collaboration between the two departments with respect to the software delivery process. Additionally, it address the challenges of new technologies, such as the cloud and software-defined infrastructures, which require strong development skills in development and the operations department.

DevOps a New Paradigm?

DevOps is not a clearly defined term and there is no reference model, such as ITIL, behind it. However, we can identify some common aspects that can be find in many different papers on the topic (cf. Gartner, this blog entry or this article):

  • A clearly defined as well as highly automated continuous software delivery pipeline across different software environments and the development as well as operations departments
    • Describe involved stakeholders with roles and responsibilities.
    • Defined and measured key performance indicators, e.g. “after committing some software source code in the development environment it is fully tested as well as deployed in the production environment and can be used by business processes within one hour”.
    • Clearly defined environments, e.g. development environment, test environment, acceptance environment and production environment. This implies a unambiguous and idempotent description (i.e. a set of scripts) of how they are created, operated, destroyed and what virtual resources (computation, memory and network) they require. This means you have to fully leverage private and public cloud technologies. Fully virtualized environments can be created by anyone using just a graphical browser interface (see next section).
    • Avoid reconfiguration of software for different environments. Environments should be ideally the same (e.g. same network addresses and same hardware) .
    • Continuous integration of software components delivered by different teams.
    • Fully automated deployment procedure in different environments. Manual deployment requiring human intervention is forbidden.
    • Fully automated regression, integration and acceptance testing. Manual test activities should be reduced to nearly zero, because rapidly changing economic environments require rapid deployment of new solutions.
    • Test-Driven development: develop the test of parts of software before you develop the software itself.
    • Deploy software in production incrementally in small chunks often (e.g. each week) as it is required when using Service-oriented Architectures (SOA). Then you avoid to make mistakes and changes cannot have catastrophic impact if they do not work. If you continue the traditional way of deploying large chunks of software with a lot of changes every few months you will continue the pay the price of production outages or obstacles to your business processes.
    • Have a consistent fully automated monitoring approach for your software environments. Leverage machine learning techniques to predict software problems before they happen.
    • Allow each stakeholder (development, test, operations and customer) to use a current build of the software deployed in a virtual environment by just using a browser interface
  • Integration of people from the operations department in the agile software development process. Operations has development skills and development has operation skills.
  • Integration of people from the development department in strategic IT operations processes
  • Clearly defined governance structures – have a sound program and project management
  • Integration of best practices, such as ITIL or agile methodologies
  • Fehlerkultur (error culture): Do not blame mistakes on each other, but solve them collaboratively. Finding out whose fault it was costs time and money and is usually not important – just solve them together as they occur. Be positive about errors and confident about handling them. Make error management part of your daily life. Each error is an opportunity to learn for all involved parties.

DevOps is NOT about flat organizations. Flat organizations make only sense if your organization has just one business area and your employee structure is relatively homogeneous and not heterogeneous. This does not imply that a strict hierarchical model is better, but you need to combine carefully control and agility.

Is DevOps right for your Organization?

Every organization should leverage and adapt DevOps practices, because they lead to significant benefits for all involved stakeholders (cf. also here). It makes sense for startups as well as large corporations. However, there are exceptions, such as research & development of disrupting new technologies. When researching & developing completely new technologies, where the impact on the infrastructure is completely unclear or where you just want to explore very risky software, where you are not certain that you will use it in the future, the overhead of a DevOps organization is too high. From my experience, there you have small teams, which are – on purpose – separated from the others to develop a new way of thinking and using novel technologies. They have to redo very often everything from scratch potentially using very diverse software technologies in a short time frame. Nevertheless, they may still employ some DevOps aspects, such as full computing, memory and network virtualization provided by cloud technologies.

What tools can I use?

You will find a lot of tools for enabling DevOps in your organization. Indeed tools are an important aspects, because you have a high degree of automation of previously manual activities and you will manage your environments using cloud virtualization technologies.

However, do not forget that it is also about organization and culture. Simply having the tools won’t help you much.

I present here only few of the many tools that can be used.

Cloud

As I mentioned before, you want to create and manage software environments fast and in a highly automatized fashion. Each environment should ideally be identical to avoid errors due to reconfiguration of software. This means they should have the same underlying software, configuration, (virtualized) hardware and the same network configuration (including the same IP-Addresses). Large-scale public cloud providers, such as Amazon EC2 or Google Compute, already have technologies that make it feasible. If you prefer a private cloud then you can use OpenStack, which is a Linux-based Cloud Computing distribution, offering similar functionality as Amazon EC2. It offers a web interface for creating new environments in a browser.

Additionally, you can use tools, such as Vagrant, to automate creation and management of software environments.

Continuous Integration

Continuous integration supports the integration of different software components from different teams. It builds and tests automatically complete applications every time a new piece of software source code is committed to the version management system, such as GIT. Popular tools are Jenkins or Travis.

Besides normal unit testing, integration testing, one should look at acceptance testing tools, such as Cucumber. There, acceptance tests can be described by business users in (nearly) natural language. This makes them repeatable and reliable.

Continuous Deployment

Deployment should be automated and repeatable independent of the environment. This means no manual configuration or manual deployment steps. You have to develop scripts that ensure that the target system is in the desired state after a deployment – independent of the state it is currently in. Popular tools, such as Puppet, Chef or Vagrant help you with this task.

Recently, web interfaces have become popular, so you can do deployment of complex applications just using your browser (cf. Ubuntu Juju).

Monitoring

Basically your monitoring infrastructure has to collect all the messages from various applications that are deployed in an environment and be able to take action on this. Usually there are a lot of message in various text formats. This means you have to find the right tools to collect and analyze a large amount of data.

You should not only present results (e.g. critical conditions), but also be able to automatically handle them (e.g. repair broken applications). Amazon OpsWorks is one product that can do this for application deployed on the Amazon EC2 cloud.

An interesting application is the Netflix ChaosMonkey. It is an excellent example for a tool supporting the Fehlerkultur (error culture) mentioned before. Basically, it switches off machines on which your software is deployed at random, but only in a certain time frame (e.g. from 9 – 17 o’clock). This means errors will be detected more easier and can be handled when all employees of the company are there. Hence, you will have no/less errors on Sunday morning, where it is difficult to get the right people to work on a problem. It should be noted that Netflix, a media streaming service, requires strict quality of service and it cannot allow itself that, for example, streaming of media to the customer is interrupted or disturbed. Nevertheless, the ChaosMonkey is switching off machines in their production environment.

The Netflix ConformityMonkey and JanitorMonkey check if the state of the systems is still acceptable or degraded. If it is not acceptable any more then the instances are automatically switched off and rebooted, so an acceptable state is available all the time. Furthermore, they switch off unused instances to reduce costs.

Recently, predictive software maintenance has become a hot topic, where you predict when your application will fail or slowed down given the environmental conditions before the event has taken place.

DevOps and Big Data

Big Data is about processing a large amount of data of different nature (e.g. structured or unstructured) in acceptable time. There is a trend to leverage big data to enable new sophisticated statistical models of complex real-world dependencies. For instance, you want to predict what customers want to buy next or when, for example, company cars are likely fail – even before the event happens.

Having DevOps is mandatory for Big Data. You will have a limited set of resources, you have very business critical applications and change will come often, e.g. new prediction models. You may also need to do experiments in production environments. For instance, you want to evaluate new machine learning technologies involving real live business data.

Furthermore, another challenge of which many people are not aware of yet is that business user will actually develop code and deploy it in production. For example, marketing specialists will develop and test new models using R. Similarly, hardware engineers will develop and deploy prediction algorithms for the reliability of hardware assets. This code will become part of existing critical business applications.

For example, you own a bus company. Your mechanics have a good understanding of the probability distributions for hardware errors, gas consumption etc. They will develop a model in a statistical programming language, such as R, to predict failure and maintenance needs of your bus based on sensor data as well as maintenance reports. The output of this is used by the scheduling application developed by your development department to avoid delays of bus service delivery. Similarly, a statistical marketing program developed by your marketing department will predict how much customers it expects when based on historical data, but also include current events (e.g. soccer match, news or tweets). This can also be included in the scheduling application. It is not very realistic that the marketing department will ask the development department to implement a statistical model proposed by them. The overhead is simply too large for experimental Big Data applications.

In fact, this has happen already a long time ago in the finance industry where business analysts created macros in Excel/Access or other office software to support calculation of their complex financial statistic models for creating more and more complex financial products. There the problem was that you have a variety of software somewhere in the business with critical impact no versioning or backup and unknown dependencies to other software or data. This is obviously bad and can even lead to a bankrupt.

How this can be handled has not been yet subject to extensive research or experiments.

Big Data: Bring Computation to Data

Big Data is the topic of the coming years. Even today large Internet companies store exabytes of data and their revenue model is based on selling products as well as services around this data. Consequently, they need to process data using advanced statistical methods, such as machine learning. Hence, they need to think about how to do this efficiently. Currently, especially in-memory is hyped to address this issue. However, this is only one aspect. A fundamentally more important aspect is where the data is processed in a distributed multi-node data environment.

A brief history on software architectures

In the beginning of software development, many applications have been single monolithic applications. They have been deployed on a single computer. This lead to several problems, such as that developers could hardly reuse code of monolithic applications and the approach did not scale very well since it was limited to a single computer. The first problem has been addressed by introducing different layers into the architecture. The resulting architectures are usually based on three layers (see next figure): data layer, service layer and presentation layer. The data layer handles any functionality for managing data, such as querying or storing it. The service layer implements business logic, e.g. it implements business process. The presentation layer allows the user to interact with the implemented business processes, e.g. entering of new customer data. The layers communicate with each other using well-defined interfaces implemented today in REST, OData, SOAP, Websockets or HTTP/2.0. threelayerarchitecture

With the emergence of the Internet, these layers had to be put physically on different machines to provide larger scalability. However, they have never been designed with this in mind. The network layer has only limited transport bandwidth and capacity. Indeed, for very large data it can be faster to store it on a large drive and transport it by truck to its destination than doing it by the network.

Additionally, during development scalability of data computation is of less interest, because in the Internet world it is often not known how many people will have access to an application and this may change over time. Hence, you need to be able to scale dynamically up an down. I observe that more and more of the development efforts in this area have moved to operations, who need to implement monitors, load-balancer and other technology to scale applications. This is also the reason why DevOps is a popular and emerging paradigm for developing and operating Internet-scale web applications, such as Netflix.

Towards New Software Architectures: Bring Computation to Data

The multiple layer approach does make sense and you could it even split it into more layers (“services”), but you have to evaluate carefully complexity and reusability of your service design. More important, you will have to think about new interfaces, because if components are located on different machines or different memory instances, your application will spend a lot of time for moving data between them. For instance, the application logic on the application server may request all customer transactions from the database and then correlate them to write the results back into the database. This requires a lot of data to be transferred from the database to the application server and potentially costs a lot of performance. Finally, it does not scale at all.

This problem first emerged when companies introduced the first Online Analytical Processing (OLAP) engines as part of business intelligence solutions for understanding their business. Database queries proved as too simple and would require to transfer first a lot of data to the application server. Hence, the Structured Query Language (SQL) for databases was extended to cope with these new requirements (e.g. the CUBE operator). Moreover, you can define your own custom functions (e.g. SQL Stored procedures), but they have to be implemented very vendor specific. For instance, distributed databases based on Apache Hadoop support custom functions. However, you can integrate sometimes other programming languages, such as Java. While stored procedures are already an improvement in terms of security (protection against SQL injection attacks), they have the problem that it is very difficult to write sophisticated programs to handle modern Big Data applications. For instance, many applications require machine learning, statistical correlation or other statistical methods. It is difficult to write them as stored procedures and to maintain support for different vendors. Furthermore, it leads again to monolithic applications. Finally, they are not dynamic – the application cannot decide to do any new computation on the fly without reimplementing it in the database layer (e.g. implement a new machine learning algorithm). Hence, I suggest another way to address this issue.

A Standard for Bringing Computation to Data?

As mentioned, we want to support modern Big Data applications by providing suitable language support for machine learning and statistical methods on top of any database system (e.g. MySQL, Hadoop, Hbase or IBM DB2). The next figure illustrates the new approach. The communication between the presentation and service layer works as usual. However, the services do not call functions on the data layer, but send any data-intensive computation they want to perform as an R script to the data layer, which executes it and only sends back the result.

bringcomputationtodataarchitecture

I observed that the programming language R for statistical computing has been recently integrated in various data environments, such as transactional databases, Apache Hadoop clusters or in-memory databases, such as SAP HANA. Hence, I think R could be a suitable language for describing computation that operates on data. Additionally, R has already a lot of built-in packages for machine learning or statistical data processing. Finally, depending on the openness of the underlying data environment, you can integrate R tightly into it, so you may not have to do extensive in-memory transfers.

The advantage of the approach are:

  • business logic stays in the service level and does not move to the data layer
  • You can easily add new services without modifying the data layer – so you avoid a tight coupling, which makes it easier to change the data layer or to introduce new functionality
  • You can mine R scripts generated by services to determine which computation the user is likely to do next to start executing it before the user requests it.
  • Caching and distribution of data processing can be based on a more sophisticated analysis of the R scripts using the R Profiler Rprof
  • R is already known by many business analysts or social scientists/psychologists

However, you will need to have some functionality for governing the execution of the R scripts in the data layer. This includes decisions on when to schedule computation or creating new computing/data nodes (e.g. real-time vs batch). This will require a company-wide enterprise architecture approach where you need to define which data should be real-time and which data should be batch-processed. Furthermore, you need to take into account security and separation of concerns.

In this context, Apache Hadoop might be an interesting solution from the technology perspective.

What is next

The aforementioned approach is only the beginning. By using this solution, you can think about true inter-cloud deployments of your application. Finally, you can enable inter-organizational data-processing business processes.

Enterprise Architecture Management in Business Networks

In my last blog post, I wrote about multi-cloud scenarios for enterprise applications focusing on enterprise applications of one company distributed over several different cloud providers. This blog post will be about enterprise applications connecting data, processes and the organization of different companies within business networks. Particularly complex scenarios with a high competition and margins, such as third party logistics (3PL) require a sophisticated approach ensuring and extending competitive advantages. We will see challenges when applying reference models, such as EDIFACT, ASC X12 or SCOR. Nevertheless, I see reference models – or more particularly their combination – as key success factors for business networks, since they represent best practices, common understanding and can significantly improve on-boarding as well as continuous education of new business network members. Hence, I will discuss how enterprise architecture and portfolio management can support the application and combination of different reference models in business networks. Finally, I present how the emerging concept of virtual software containers can support this approach from a technology perspective.

Types of Business Networks

One interesting question is what constitutes a business network [1]. Of course, it can be predefined and agreed upon, but there are a lot of business networks, where there are undefined and informal relations between two companies that have also (in-)dependent relations with other companies. The whole network of relations is called a business network. This is very similar to social networks where there are two related human beings that have independent relations to other human beings. However, all types of business networks have different forms of implicit or explicit governance, i.e. decision-making structures. Implicit governance refers to the fact that the chosen governance model has not been defined or agreed on by all involved parties in a business network. Explicit governance refers to an awareness and definition of governance arrangements by all parties in a business network.

The following generic modes of governance can exist in a business network (see next figure):

  • One inherits most / all types of decision-making roles and the others have merely an execution role

  • A group inherits most / all types of decision-making roles and a majority has only a execution role

  • Several large groups with decision-making roles related to different aspects and a majority has only execution role

  • Everybody has every role

businessnetworkgovernance

Additionally, business networks may expose a different degree of awareness and intensity of relations. On the one hand you may have a very structured business network, such as supply chains and on the other hand there is the free market where two parties directly interact without considering other parties in their interaction. Both extremes are unlikely and we will find companies on the whole spectrum. For instance, within a larger supply chain, one company may know only the direct predecessor and the direct successor company. It may just agree on the specification of the product to be delivered, but may not include any data or impose any processes on how the product should be manufactured. This means there is a limited degree of awareness and the intensity is less strong, because they do not really know how something is achieved by the other organizations in a business network.

contractlogistics

It can be observed that business networks become more complex, because new types of business networks emerge, such as contract logistics or third party logistics, where your business partners directly integrate dynamically in your manufacturing plant or point of sale as well as corresponding business processes. Hence, you need to work out best practices and stay ahead of the competition. An example can be seen in the previous figure, where the third party logistics provider has a packaging business process deployed at “Manufacturing Plant A”. This business process leverages applications and other resources within the sphere of “Manufacturing Plant A”. Besides delivery, the third party logistics provider integrates similarly in “Manufacturing Plant B”, where it does pre-assembly of the delivered parts from “Manufacturing Plant A”.

Applying Reference Models for Business Networks

Reference models exist since several decades in the area of business information systems, management and software engineering. Some are driven by academia and others are driven by industry. Usually both have been validated scientifically and in practice.

Reference models represent best industry practices for business processes derived from experts and organizations. They can cover the process, organization/governance, product, data and/or IT application perspective within a given business domain. Hence, they can also be viewed as standards. Examples for reference models are EDIFACT, SCOR, Prince2 or TOGAF. These are rather generic models, but there are also industry specific ones, such as the one existing for humanitarian supply chain operations [2] or retailing [3].

The main benefits of reference models with respect to business networks are:

  • Support your Enterprise Architecture Management (e.g. by reduced modeling efforts, transparency or common language)

  • Benchmark against industry

  • Evaluation of applications for enabling business networks

  • Business network integration by integrating available applications in a business network

There some issues involved when using reference models:

  • They are “just” models. Having them is like having a book on a shelf – pretty useless

  • Some of them are very generic applying to any business case/network and others are very specific

  • Some focus on business processes (e.g. SCOR), some on business data (e.g. EDIFACT, ASC X12), nearly none on organizational/governance aspects, others on material or money flows and others combine only some of the aspects (e.g. ARIS)

  • Some do provide key performance indicators for benchmarking your performance against the reference model, but most do not

  • It is unclear how different reference models can be combined and tailored to enable business networks

  • Tools supporting definition, viewing, visualizing, expertise provisioning, publishing or adaptation of these models are not standardized and a wide variety exists

  • Tools supporting monitoring the implementation of reference models in information systems consisting of technology and humans do not really exist

There exist already reference models for business networks, such as EDIFACT, and they are used successfully in practice. However, in order to gain benefit from reference models in a business network, you will need to have an integrated approach addressing the aforementioned issues as I will present in the next section.

Enterprise Architecture Management in Business Networks

Reference models are needed for superior business performance to deal with the increasing complexity of business networks. You will never have a perfect world by using only one reference model. Hence, you will need an enterprise architecture management approach for business networks to efficiently and effectively address the issues of one single reference model by combining several reference models (see next figure). Traditionally, enterprise architecture management focused only on the single enterprise and not business networks, but given the growing complexity of business networks and disrupting societal changes, it is mandatory to consider the business network dimension.

referencemodelpuzzle

Establishing an enterprise architecture management approach depends on the type of business network as I have explained before. For example, you may have one organization selecting and managing your reference model portfolio and application landscape for the whole business network. You may have also no one responsible, but you need to align and be aware of each other’s portfolio. For instance, you can create a steering board for this. Additionally, you will need to establish key performance indicators and benchmarking processes with respect to the business network’s reference model portfolio.

Once you have your enterprise architecture management approach leveraging combined and tailored reference models, you will have to address the aforementioned dynamics as well as tight integration between business partners in the business network’s information systems. Traditional ERP, CRM and SCM software packages will face difficulties, because even if all partners would use the same systems, there would be a huge variety of configurations to reflect the different internal business processes of members of a business network. Additionally, you will have to manage access and provisioning over the Internet.

Cloud-based solutions address these challenges already partially. They help you to understand how to manage access, governance and provide clearly defined interfaces via the emerging concept of API Management. However, these approaches do not reach far enough. You cannot move dynamically business processes and corresponding applications and data between organizations as a package to integrate it at your business partners’ premise. Furthermore, business processes may change quickly and you want to reuse as well as leverage the change in many different organizations using corresponding applications. This may facilitate a lot of scenarios, such as “bring your own digital business process” in third party logistics. Hence, there is still a need for further technology innovation and research.

Conclusion: Software Containers for “Bring your own Digital Business Process”

We have seen that new complex scenarios in business networks, such as third party logistics, as well as the high competition, tight network integration and dynamics impose new challenges. Instant business network adaptation as well as tight integration between business partners will be a key differentiator between business networks and ultimately decide about their success. Reference models representing industry best practice need to be combined and tailored on the business network level to achieve its future goals. However, no silver bullet exists, so you will also need to enable enterprise architecture management at the business network level. Finally, you need tools to enable dynamic movement of business processes as well as applications between different organizations in a business network. A coherent and reusable approach should be used.

Unfortunately, these tools do not exist at the moment, but there are some first approaches, which you should investigate in this context. Docker can create containers consisting of digital business process artifacts, applications, databases and many more. These containers can be sent over the business network and easily be integrated with containers existing in other organizations. Hence, the vision of instant dynamic business network adaption might not be as far-fetched as we think. The next figure illustrates this idea: The third party logistics provider sends the containers “Packaging” and “Pre-Assembly” to its business partners. These containers consists of applications supporting the corresponding business process. They are executed in the business partners’ clouds and they integrate with the existing business processes and applications there (e.g. the ERP system). Employees of the third party logistics provider use them at the side of the business partner. The containers are executed at the business partner side, because the business process takes anyway place there and thus it makes sense to let it also digitally happen there, before we send a lot of data and information to the network back and forth or having a lack of application integration.

businessnetworksoftwarecomponent

References

[1] Harland, C.M.: Supply Chain Management: Relationships, Chains and Networks, British Journal of Management, Volume 7, Issue Supplement s1, p . 63-680, 1996.

[2] Franke, Jörn; Widera, Adam; Charoy, François; Hellingrath, Bernd; Ulmer, Cédric: Reference Process Models and Systems for Inter-Organizational Ad-Hoc Coordination – Supply Chain Management in Humanitarian Operations, 8th International Conference on Information Systems for Crisis Response and Management (ISCRAM’2011), Lisbon, Portugal, 8-11 May, 2011.

[3] Becker, Jörg; Schütte, Reinhard: A Reference Model for Retail Enterprise, Reference Modeling for Business Systems Analyses, (eds.) Fettke, Peter; Loos, Peter, pp. 182-205, 2007.

[4] Verwijmeren, Martin: Software component architecture in supply chain management, Computers in Industry, 53, p. 165-178, 2004.

[5] Themistocleous, Marinos; Irani, Zahir; Love, Peter E.D.: Evaluating the integration of supply chain information systems: a case study”, European Journal of Operational Research, 159, p. 393-405, 2004.

Scenarios for Inter-Cloud Enterprise Architecture

The unstoppable cloud trend has arrived at the end users and companies. Particularly the first ones openly embrace the cloud, for instance, they use services provided by Google or Facebook. The latter one is more cautious fearing vendor lock-in or exposure of secret business data, such as customer records. Nevertheless, for many scenarios the risk can be managed and is accepted by the companies, because the benefits, such as scalability, new business models and cost savings, outweigh the risks. In this blog entry, I will investigate in more detail the opportunities and challenges of inter-cloud enterprise applications. Finally, we will have a look at technology supporting inter-cloud enterprise applications via cloudbursting, i.e. enabling them to be extended dynamically over several cloud platforms.

What is an inter-cloud enterprise application?

Cloud computing encompasses all means to produce and consume computing resources, such as processing units, networks and storage, existing in your company (on-premise) or the Internet. Particularly the latter enable dynamic scaling of your enterprise applications, e.g. you get suddenly a lot of new customers, but you do not have the necessary resources to serve them all using your own computing resources.

Cloud computing comes in different flavors and combinations of them:

  • Infrastructure-as-a-Service (IaaS): Provides hardware and basic software infrastructure on which an enterprise application can be deployed and executed. It offers computing, storage and network resources. Example: Amazon EC2 or Google Compute.
  • Platform-as-a-Service (PaaS): Provides on top of an IaaS a predefined development environment, such as Java, ABAP or PHP, with various additional services (e.g. database, analytics or authentication). Example: Google App Engine or Agito BPM PaaS.
  • Software-as-a-Service (SaaS): Provides on top of a IaaS or PaaS a specific application over the Internet, such as a CRM application. Example: SalesForce.com or Netsuite.com.

When designing and implementing/buying your enterprise application, e.g. a customer relationship management (CRM) system, you need to decide where to put in the cloud. For instance, you can put it fully on-premise or you can put it on a cloud in the Internet. However, different cloud vendors exist, such as Amazon, Microsoft, Google or Rackspace. They offer also a different flavor of cloud computing. Depending on the design of your CRM, you can put it either on a IaaS, PaaS or SaaS cloud or a mixture of them. Furthermore, you may only put selected modules of the CRM on the cloud in the Internet, e.g. a module for doing anonymized customer analytics. You will also need to think about how this CRM system is integrated with your other enterprise applications.

Inter-Cloud Scenario and Challenges

Basically, the exemplary CRM application is running partially in the private cloud and partially in different public clouds. The CRM database is stored in the private cloud (IaaS), some (anonymized) data is sent to different public clouds on Amazon EC2 (IaaS) and Microsoft Azure (IaaS) for doing some number crunching analysis. Paypal.com is used for payment processing. Besides customer data and buying history, the databases contains sensor information from different point of sales, such as how long a customer was standing in front of an advertisement. Additionally, the sensor data can be used to trigger some actuators, such as posting on the shop’s Facebook page what is currently trending, using the cloud service IFTTT. Furthermore, the graphical user interface presenting the analysis is hosted on Google App Engine (PaaS). The CRM is integrated with Facebook and Twitter to enhance the data with social network analysis. This is not an unrealistic scenario: Many (grown) startups already deploy a similar setting and established corporations experiment with it. Clearly, this scenario supports cloud-bursting, because the cloud is used heavily.

I present in the next figure the aforementioned scenario of an inter-cloud enterprise application leveraging various cloud providers.

intercloudarchitecture

There are several challenges involved when you distribute your business application over your private and several public clouds.

  • API Management: How to you describe different type of business and cloud resources, so you can make efficient and cost-effective decisions where to run the analytics at a given point in time? Furthermore, how to you represent different storage capabilities (e.g. in-memory, on-disk) in different clouds? This goes further up to the level of the business application, where you need to harmonize or standardize business concepts, such as “customer” or “product”. For instance, a customer described in “Twitter” terms is different from a customer described in “Facebook” or “Salesforce.com” terms. You should also keep in mind that semantic definitions change over time, because a cloud provider changes its capabilities, such as new computing resources, or focus. Additionally, you may dynamically change your cloud provider without disruption to the operation of the enterprise application.
  • Privacy, risk and Security: How do you articulate your privacy, risk and security concerns? How do you enforce them? While there are already technology and standards for this, the cloud setting imposes new problems. For example, once you update the encrypted data regularly the cloud provider may be able to determine from the differences parts or all of your data. Furthermore, it may maliciously change it. Finally, the market is fragmented without an integrated solution.
  • Social Network Challenge: Similarly to the semantic challenge, the problem of semantically describing social data and doing efficient analysis over several different social networks exist. Users may also change arbitrarily their privacy preferences making reliable analytics difficult. Additionally, your whole company organizational structure and the (in-)official networks within your company are already exposed in social business networks, such as LinkedIn or Xing. This blurs the borders of your enterprise further to which it has to adapt by integrating social networks into its business applications. For instance, your organizational hierarchy, informal networks or your company’s address book exist probably already partly in social networks.
  • Internet of Things: The Internet of Things consists of sensors and actuators delivering data or executing actions in the real world supported by your business applications and processes. Different platforms exist to source real world data or schedule actions in the real world using actuators. The API Management challenge exists here, but it goes even beyond: You create dynamic semantic concepts and relate your Internet of Things data to it. For example, you have attached an RFID and a temperature sensor to your parcels. Their data needs to be added to the information about your parcel in the ERP system. Besides the semantic concept “parcel” you have also that one of a “truck” transporting your “parcel” to a destination, i.e. you have additional location information. Furthermore it may be stored temporarily in a “warehouse”. Different business applications and processes may need to know where the parcel is. They do not query the sensor data (e.g. “give me data from tempsen084nl_98484”), but rather formulate a query “list all parcels in warehouses with a temperature above 0 C” or “list all parcels in transit”. Hence, Internet of Thing data needs to be dynamically linked with business concepts used in different clouds. This is particularly challenging for SaaS applications, which may have different conceptualization of the same thing.

Enterprise Architecture for Inter-Cloud Applications

You may wonder how you can integrate the above scenario at all in your application landscape and why you should do it at all. The basic promise of cloud computing is that it scales according to your needs, that you can outsource infrastructure to people who have the knowledge and capabilities to run the infrastructure. Particularly, small and medium size enterprises benefit from this and the cost advantage. It is not uncommon that modern startups start their IT using the cloud (e.g. FourSquare).

However, also large corporations can benefit from the cloud, e.g. as a “neutral” ground for a complex supply chain with a lot of partners or to ramp up new innovative business models where the outcome is uncertain.

Be aware that in order to offer some solution based on the cloud you need to first have a solid maturity of your enterprise architecture. Without it you are doomed to fail, because you cannot make proper risk and security analysis, scaling and benefit from cost reductions as well as innovation.

I propose in the following figure an updated model of the enterprise architecture with new components for managing cloud-based applications. The underlying assumption is that you have an enterprise architecture, more particularly a semantic model of business objects and concepts.

intercloudarchitecturenew

  • Public/Private Border Gateway: This gateway is responsible for managing the transition between your private cloud and different public clouds. It may also deploy agents on each cloud to enable a secure direct communication between different cloud platforms without the necessity to go through your own infrastructure. You might have more fine granular gateways, such as private, closest supplier and public. A similar idea came to me a few years ago when I was working on inter-organizational crisis response information systems. The gateway is not only working on the lower network level, but also on the business processes and objects level. It is business-driven and depending on business processes as well as rules, it decides where the borders should be set dynamically. This may also mean that different business processes have access to different things in the Internet of Things.
  • Semantic Matcher: The semantic matcher is responsible for translating business concepts from and to different technical representations of business objects in different cloud platforms. This can be simple transformations of not-matching data types, but also enrichment of business objects from different sources. This goes well beyond current technical standards, such as EDI or ebXML, which I see as a starting point. Semantic matching is done automatically – there is no need for creating time consuming manual mappings. Furthermore, the semantic matcher enhances business objects with Internet of Things information, so that business applications can query or trigger them on the business level as described before. The question here is how you can keep people in control of this (see Monitor) and leverage semantic information.
  • API Manager: Cloud API management is the topic of the coming years. Besides the semantic challenge, this component provides all necessary functionality to bill, secure and publish your APIs. It keeps track how is using your API and what impact changes on it may have. Furthermore, it supports you to compose new business software distributed over several cloud platforms using different APIs subject to continuous change. The API Manager will also have a registry of APIs with reputation and quality of service measures. We see now a huge variety of different APIs by different service providers (cf. ProgrammableWeb). However, the scientific community and companies have not picked up yet the inherent challenges, such as the aforementioned semantic matching, monitoring of APIs, API change management and alternative API compositions. While there exists some work in the web service community, it has not yet been extended to the full Internet dimension as it has been described in the scenario here. Additionally, it is unclear how they integrate the Internet of Thing paradigm.
  • Monitor: Monitoring is of key importance in this inter-cloud setting. Different cloud platforms offer different and possible very limited means for monitoring. A key challenge here will be to consolidate the monitoring data and provide an adequate visual representation to do risk analysis and selecting alternative deployment strategies on the aggregated business process level. For instance, by leveraging semantic integration we can schedule request to semantically similar cloud and business resources. Particularly, in the Internet of Thing setting, we may observe unpredictable delays, which lead to delayed execution of real-world activities, e.g. a robot is notified that a parcel flew off the shelf only after 15 minutes.

Developing and Managing Inter-Cloud Business Applications

Based on your enterprise architecture you should ideally employ a model-driven engineering approach. This approach enables you automation of the software development process. Be aware that this is not easy to do and failed often in practice – However, I have also seen successful approaches. It is important that you select the right modeling languages and you may need to implement your own translation tools.

Once you have all this infrastructure, you should think about software factories, which are ideal for developing and deploying standardized services for selected platforms. I imagine that in the future we will see small emerging software factories focusing on specific aspects of a cloud platform. For example, you will have a software factory for designing graphical user interfaces using map applications enhanced with selected Odata services (e.g. warehouse or plant locations). In fact, I expect soon a market for software factories which enhances the idea of very basic crowd sourcing platforms, such as the Amazon Mechanical Turk.

Of course, since more and more business applications shift towards the private and public clouds, you will introduce new roles in your company, such as the Chief Cloud Officer (CCO). This role is responsible for managing the cloud suppliers, integrating them in your enterprise architecture and proper controlling as well as risk management.

Technology

The cloud exists already today! More and more tools emerge to manage it. However, they do not take into account the complete picture. I described several components for which no technologies exist. However, some go in the right direction as I will briefly outline.

First of all, you need technology to manage your API to provide a single point of management towards your cloud applications. For instance, Apache Delta Cloud allows managing different IaaS provider, such as Amazon EC2, IBM SmartCloud or OpenStack.

IBM Research also provides a single point of management API for cloud storage. This goes beyond simple storage and enables fault tolerance and security.

Other providers, such as Software AG, Tibco, IBM or Oracle provide “API Management” software, which is only a special case of API Management. In fact, they provide software to publish, manage the lifecycle, monitor, secure and bill your own APIs for the public on the web. Unfortunately, they do not describe the necessary business processes to enable their technology in your company. Besides that, they do not support B2B interaction very well, but focusing on business to development aspects only. Additionally, you find registries for public web APIs, such as ProgrammableWeb or APIHub, which are first starting point to find APIs. Unfortunately, they do not feature sematic description and thus no semantic matching towards your business objects, which means a lot of laborious manual work for doing the matching towards your application.

There is not much software for managing the borders between private and public cloud or even allowing more fine-granular borders, such as private, closest partner and the public. There is software for visualizing and monitoring these borders, such as the eCloudManager by Fluid Operations. It features semantic integration of different cloud resources. However, it is unclear how you can enforce these borders, how you control them and how can you manage the different borders. Dome 9 goes into this direction, but focuses only on security policies for IaaS applications. It does only understand data and low level security, but not security and privacy over business objects. Deployment configuration software, such as Puppet or Chef, are only first steps, since they focus only on deployment, but not on operation.

On the monitoring side you will find a lot of software, such as Apache Flume or Tibco HAWK. While these operate more on the lower level of software development, IFTTT enables execution of business rules over data on several cloud providers providing public APIs. Surprisingly, it considers itself at the moment more as a end user facing company. Additionally, you find in the academic community approaches for monitoring distributed business processes.

Unfortunately, we find little ready to go software in the area “Internet of Things”. I worked myself with several R&D prototypes enabling cloud and gateways, but they are not ready for the market. Products have emerged but they are only for a special niche, e.g. Internet of Things enabled point of sale shop. They lack particularly a vision how they can be used in an enterprise-wide application landscape or within a B2B enterprise architecture.

Conclusion

I described in this blog the challenges of inter-cloud business applications. I think in the near future (3-5 years) all organizations will have some them. Technically they are already possible and exist to some extent. The risk and costs will be for many companies lower than managing everything on their own. Nevertheless key requirement is that you have a working enterprise architecture management strategy. Without it you won’t have any benefits. More particularly, from the business side you will need adequate governance strategies for different clouds and APIs.

We have seen already key technologies emerging, but there is still a lot to do. Despite decades of research on semantic technologies, there exists today no software that can perform automated semantic matching of cloud and business concepts existing in different components of an inter-cloud business application. Furthermore, there are no criteria on how to select a semantic description language for business purposes that are as broad as described here. Enterprise Architecture Management tools in this area only slowly emerge. Monitoring is still fragmented with many low level tools, but only few high-level business monitoring tools. They cannot answer simple questions, such as “what if cloud provider A goes down then how fast can I recover my operations and what are the limitations”. API Management is another evolving area, but which will have a significant impact in the coming years. However, current tools only consider low-level technical aspects and not high-level business concepts.

Finally, you see that a lot of challenges mentioned in the beginning, such as the social network challenge or Internet of Thing challenge, are simply not yet solved, but large scale research efforts are on their way. This means further investigation is needed to clarify the relationships between the aforementioned components. Unfortunately, many of the established middleware vendors lack a clear vision of cloud computing and the Internet of Things. Hence, I expect this gap will be filled by startups in this area.

Modularizing your Business and Software Component Design

In this blog, I will talk about modularizing your enterprise from a business and software perspective. We start from the business perspective, where I provide some background how today’s businesses are modularized. Afterwards, we will investigate how we can support the modularized business with software components and how they can be designed. Finally, we will see some software tools enabling component-based design for a modularized business, such as the service component architecture (SCA) or OSGi.

Business perspective

You will find a lot of different definitions of what and how a business can be modularized. Most commonly, business modules are known as business functions, such as controlling, finance, marketing, sales or production. Of course you can view this also on a more fine granular level. Furthermore, we may have several instances of the same module. This is illustrated in the following figure. On the left-hand side the business modules of a single enterprise are shown. On the right-hand side you see the business modules of decentralized organizations. There, the enterprise is split up in several enterprises, one for each region. Business modules are replicated across regions, but adapted to local needs.

businessarchitecture

A module has usually clear interfaces to other modules. For instance, in earlier times you used paper forms to order something from the production department.

One of the most interesting questions is how one should design business modules. Well there is no easy answer to this, but one goal is to reduce complexity between modules. This means there should not be many dependencies between modules, if any. There can be a lot of dependencies within one module. For instance, people work very closely together in the production department, because they share common knowledge and resources, such as machines or financial ones.

On the other side, production and sales have some very different business processes. Obviously, they are still dependent, but this should be done through a clear interface between them. For example, there can be regular feedback from a sales person to the production engineer on what the customer needs.

Clearly, it depends on the economic environment how you define business modules and the organization. However, this environment changes and this means business modules can be retired, new interfaces or completely new business modules be created.

Unfortunately, this is usually not very well documented and communicated in many businesses. Particularly, the conditions why a business has been designed out of a given set of modules and dependencies exists usually only in the head of some people. Additionally, the interfaces between business modules and their purpose are often not obvious. This can mean significant loss of competitive advantages.

Linking Business and IT Perspective: Enterprise Architecture

Business and IT have not necessarily the same goals. This means they need to be aligned, so that they are not conflicting. Hence, you need to map your business modules to your IT components. This process is called Enterprise Architecture Management. During this process the Enterprise Architecture is constantly modified and adapted to the economic environment.

Of course, you cannot map all your business and your whole IT, because this would be too costly. Nevertheless, you need to choose the important parts that you want to map. Additionally, you will need to define governance processes and structures related to this, but this is not part of this blog entry.

One popular, but simple, illustration is an enterprise architecture composed of four architectures:

  • The Business Architecture describes the business functions/modules, their relations within business processes, people, the underlying strategy, business goals and the relevant economic environment.
  • The Information Architecture is about the business data, their relationships to business functions/modules and processes, the people, its value as well as security concerns.
  • The Software Architecture depicts different kind of components according to IT goals, their relations to business data and business functions/modules.
  • The Technology Architecture is about the technology foundation for enabling the other architectures. It describes the basic infrastructure in form of hardware and software components. This includes local environments as well as cloud environments, such as OpenStack, Google Compute or Amazon EC2.

Some people advocate additionally an IT security architecture. I propose to model it not as an additional architecture, but include IT security concerns in each of the aforementioned architectures. This increases the awareness for IT security in your business. Nevertheless, appropriate tools can generate from the models a complete security view over all architectures.

There are many popular tools, such as the ARIS toolset to map your enterprise architecture.

Of course, you cannot only define top-down from business to IT how this architecture should be designed. You need to take into account the IT perspective.

IT perspective

As mentioned, IT and business goals are not necessarily the same. IT focuses on three areas: storing of information (storage), processing of information (computation) and transporting information (network). The goal is to do this in an efficient manner: Only the minimum of information should be stored, processing information should be as fast as possible and transporting information should only consume minimal resources. Clearly, there are constraints ranging from physical laws over business goals to IT Security concerns. For instance, the three key IT Security goals, namely confidentiality, integrity and availability often have negative impact on all three IT goals.

As I have explained before: business modules are mapped to software components and vice versa. One big question here is about the design of software components, i.e. what software functionality (representing business functionality) should be part of one software component and not one of the others. Software components are usually different from the business modules, because they have different design criteria.

In the past, people often used heuristics. E.g. they introduce “data components” and “functional components”. This makes sense, because you should not have 50 different databases, but only the right amount of databases for your purpose (e.g. one for NoSQL, one for SQL and/or probabilistic databases). This reduces the resource needs and avoids inconsistent data. However, there is no general approach how these heuristics should be applied by different enterprise architects. Another problem is that communication patterns (e.g. via message brokers, such as RabbitMQ) are completely left out.

Hence, I think a more scientific or general approach should be taken towards the design of components, because these heuristics do not give you good guidelines for a sustainable architecture. Based on the three IT focus areas, I propose to have software components for storage (e.g. database), computation (e.g. business logic) and network (e.g. message brokers). Once you have identified these components, you need to decide which functionality you should put in which component. One key goal should be to reduce the dependencies between components. The more communication you have, the more dependencies you have between the different functions in components. Evaluating this manually can be costly and error prone. Luckily, some approaches do this for your and they can be integrated with business modeling as well as software component management tools (cf. here an approach that derives the design of software components (managed using the service component architecture (SCA)) from the communication pattern in business processed (modeled using the business process modeling notation (BPMN)).

Another mean for coherent software component design is to have enterprise architects responsible for mapping one part of the business (e.g. controlling) reviewing the software architecture for another part of the business (e.g. marketing). I introduced such a process in an enterprise some time ago. Such an approach ensures that architecture decisions are made consistent across the enterprise architecture and this fosters also learning from other areas.

Finally, a key problem that you need to consider is the lifecycle management of a software component. Similar to the lifecycle of business modules, software components are designed, implemented, deployed and eventually retired. You need tools to appropriately manage software components.

Tools for Managing Software Components

Previously, I elaborated on the requirements for managing software components:

  • Handle interfaces with other components

  • Support the lifecycle of software components

Two known information technologies for managing software components are OSGi and the Service Component Architecture (SCA).

OSGi

OSGi is a framework for managing software components and their dependencies/interfaces to other components. It is developed by the OSGi alliance. It origins from the Java world and is mostly suitable for Java components, although it has limited support for other non-Java platforms. It considers the lifecycle of components by ensuring that all needed components are started when a component is started and by being able to stop components during runtime. Furthermore, other components and their interfaces can be discovered at runtime. However, there is no deployment method for software components part of the standard.

Since Java can run on many different devices, it is available for Android, iOS, embedded devices, personal computers and servers.

Unfortunately, tool support for linking OSGi and business or information architecture is very limited. Furthermore, an automatic generation and deployment of OSGi components from your enterprise architecture does not exist at the moment. This makes it difficult to understand software component and their relations within your enterprise architecture.

Many popular software projects are based on OSGi, such as the Eclipse project.

Service Component Architecture (SCA)

The service component architecture is a specification for describing software components, their interfaces and their dependencies. It is developed by members of the Organization for the Advancement of Structured Information Standards (OASIS). It does not depend on a specific programming platform, e.g. it supports Java and C++. It supports policies that govern components, a set of components or their communication. However, SCA does not consider the software component lifecycle or how they are deployed exactly.

It is supported by many middleware frameworks, such as TIBCO Active Matrix or Oracle Fusion Middleware.

Similarly to OSGi there is little tool support for linking SCA components and the business or information architecture. However, the SCA specification has a graphical modeling guideline and some recent work describes how they can be linked via business processes. Since OASIS is responsible for other enterprise architecture relevant modeling notations (e.g. BPMN), it can be expected that enterprise architecture tools can be adapted to provide support for linking different parts of the enterprise architecture.

Conclusion

Modularizing your business and designing software component is a difficult task. Not many people understand the whole chain from business to software components. While enterprise architecture and modeling has become a popular topic in research and practice, the whole tool chain from business to software components is not. There have been attempts to introduce model-driven-architecture (MDA), but the supported models were mostly only restricted to the Unified-Modeling-Language, which is not very suitable for business modeling and can be very complex. Additionally it does not take into account the software component lifecycle. Furthermore, the roles of the different stakeholders (e.g. business and IT) using these tools are unclear.

Nevertheless, new approaches based on the business process modeling notation and frameworks for managing software components make me confident that this will change in the near future. Growing IT complexity in terms of communication and virtualization infrastructure will require software support for managing software components. Companies should embrace these new tools to gain competitive advantages in terms of agility and flexibility.

The Next Generation HTTP Protocol (HTTP 2.0) for Enterprise Applications

I will talk in this blog about the next generation HTTP Protocol (HTTP 2.0) and put special emphasis on the implications for enterprise applications. Starting with the challenges and recent improvements to the HTTP protocol, such as WebSockets, I will describe the current state of the HTTP 2.0 specification. Finally, I will discuss implications for distributed web applications and enterprise service bus/complex event processing based enterprise applications. The WebRTC protocol is seen complementary to the HTTP 2.0 specification.

Introduction

The recent version of the HTTP protocol is 1.1 and is used by most of the web servers, proxies as well as browsers on the Internet. The main difference to HTTP 1.0 is that a connection can be reused, i.e. each request for a resource, such as image or HTML files, uses the same connection without the overhead of creating a connection for each of them. This already shows the need to reduce the number of connections to avoid overload of firewalls or the network stack.

Furthermore, new protocols have emerged based on HTTP. Their goal is to support real-time applications, such as collaborative editing (cf. Apache Wave). Other examples can be found in the area of adaptive streaming, such as apple live streaming. Clearly, these new killer application required adaptation to the existing HTTP protocol standard. I will briefly describe these applications and explain why these adaptations are still somehow flawed and require a new standard: HTTP 2.0.

Applications

Real-Time

Real-time applications require a permanent connection to a web server to push events or data to the server. Contrary to the standard request-response approach the connection is never terminated. The underlying assumption is that the application does not know exactly how much data needs to be transferred when to the server. However, data transfers occur frequently. One example is collaborative editing: There, we need to transfer text additions, changes, removals to the server and ultimately to other participants in the collaborative editing sessions. More advanced collaborative editors may also transfer other events, such as clicks or highlighting text. Given the context of real-time applications, the WebSocket standard has been developed. This standard enables a permanent connection for the aforementioned purposes. Basically, it uses HTTP 1.1, but does not transfer a lot of header information (see also example of a standard HTTP request below). Mostly JavaScript applications leverage this standard. For compatibility reasons the JavaScript libraries sock.js or socket.io/engine.io support a similar approach which works for older browsers or proxies. This is based on various techniques, such as XHR polling.

Media Streaming

Many popular live video streaming protocols are based on HTTP, such as apple live streaming or MPEG DASH. Basically they offer a list of links to chunks (short media blocks of few seconds length) of the media stream. These chunks are then downloaded via HTTP.

Challenges

Although we identified already some improvements with respect to HTTP 1.1, there are several issues with the current HTTP protocol:

  1. Communication is text-based

  2. No prioritization of data streams

Communication is Text-based

If we look at a standard request response then we see that there is a lot of over-head due to the fact that the communication is human-readable.

This can be seen from the following requests (via Google Chrome):

HTTP Example Request (Assumption: connection to server www.wikipedia.org established)

GET / HTTP/1.1
Host: http://www.wikipedia.org
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.70 Safari/537.17
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-GB,en-US;q=0.8,en;q=0.6
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*

HTTP Example Response

Age:428
Cache-Control:s-maxage=3600, must-revalidate, max-age=0
Connection:keep-alive
Content-Encoding:gzip
Content-Length:11603
Content-Type:text/html; charset=utf-8
Date:Tue, 12 Feb 2013 22:36:12 GMT
Last-Modified:Mon, 11 Feb 2013 01:58:47 GMT
Server:Apache
Vary:Accept-Encoding
X-Cache:HIT from amssq38.esams.wikimedia.org
X-Cache:HIT from knsq24.knams.wikimedia.org
X-Cache-Lookup:HIT from knsq24.knams.wikimedia.org:80
X-Cache-Lookup:HIT from amssq38.esams.wikimedia.org:3128
X-Content-Type-Options:nosniff

[..] (content of the html page)

All attributes and values are human-readable. Thus, they consume a lot of overhead for transferring them each time we make a request and receive a response. Clearly this is a problem for real-time or media streaming applications. Furthermore, there is overhead when parsing them for further processing.

No Prioritization of Data Streams

The underlying assumption of HTTP is that all data streams have the same priority. However, this is not true for all applications. Let us imagine you upload several large images for your collaborative editing application. At the same time you modify some text. This can mean that the other users in the collaborative editing session won’t see the updated text until the images are uploaded. Clearly this is an undesirable situation.

The Current State

If we imagine further applications, such as an enterprise service bus, which can be based on web services/SOAP services or REST services, then we can see a lot of room for improvement. Hence, some big vendors have developed proprietary HTTP extensions:

SPDY by Google has been used as a first draft for the new HTTP 2.0 protocol, which is in process of being standardized by the IETF. Microsoft approach seems to be based on SPDY, but makes some parts of it optional to take into account mobile devices, which have limited resources and should, for example, only deal with encrypted content if necessary. Microsoft proposed extensions have been submitted to the IETF to be taken into account for HTTP 2.0

The main improvements of SPDY compared to the HTTP 1.1 protocol are the following:

  • Reduce the overhead of HTTP by tokenizing headers, compressing data or remove unnecessary headers.
  • Single data channel for multiple request (multiplexing). This has been to some extent already part of HTTP 1.1.
  • Prioritization of data (e.g. text editing events have higher priority than image upload events).
  • Server can push data to the client or suggest data to the client to be requested.
  • Further security features.

Luckily SPDY is designed in a way that it is backward-compatible. It only changes the way how HTTP data is transmitted, but existing applications do not need to be modified. The underlying assumption is that there is a translator in the middle (e.g. a proxy or directly provided by the application/library). More advanced features/applications have to implement HTTP 2.0 natively to control and leverage all features.

At the moment the following browsers implement SPDY:

  • Google Chrome

  • Mozilla Firefox

  • Opera

  • Amazon Silk (Cloud Browser)

Some proxies, such as NGINX, support SPDY. Furthermore, first popular web pages implement SPDY.

Implications

Clearly, the design of HTTP 2.0 has many interesting implications. Firstly, we note that HTTP 2.0 is not purely an application layer protocol as it is described in the OSI network layer architecture. It is partly a session, presentation and transport protocol.

Secondly, we can see that prioritization of data streams is an extremely powerful feature. Especially, if we consider not only one application, but several applications integrated via an enterprise service bus. The enterprise service bus can be also compared with some kind of advanced HTTP 2.0 proxy. Imagine premium customers connected via a web site. The website is integrated with the CRM and production system via the enterprise service bus. Interactions with these customers via the web site are prioritized over to interactions with basic customers.

Thirdly, complex event processing application can leverage the speed improvements of HTTP 2.0 over HTTP 1.0 and analyze streams as well as correlate events in different streams over different users.

HTTP 2.0 may have the potential to replace not only HTTP 1.1, but also other protocols, in the future. Support and implementations of major software vendors demonstrate the seriousness of HTTP 2.0.

At the same time, we see further protocols emerging, such as the WebRTC protocol. The WebRTC protocol is a browser-based peer to peer protocol for voice/video chat and real-time applications. It is thus not server-based. Hence, it can be seen as complementary to the HTTP 2.0 protocol.

In the future WebRTC and HTTP 2.0 may also be combined to HTTP 3.0 to fulfill the vision of a truly decentralized Internet architecture.