The Lambda Architecture for Big Data in your Enterprise

I will present in this blog post the Lambda architecture for Big Data. This architecture is about integrating historical Big Data with “live” streaming Big Data. Afterwards, the concept of a large data lake in your enterprise or amongst enterprises in a B2B scenario is explained. This data lake – based on the lambda architecture – can replace a service oriented architecture (SOA), because it is easier to implement and manage for large data volumes in a variety of formats. Hence, a plethora of use cases arises. Finally, I will discuss how this architecture can be implemented using various open source software technologies based on the Hadoop Ecosystem.

The Lambda Architecture

Big Data has become an increasing popular topic over the last years. Big Data is about processing large volumes of data in a variety of formats taken into account live  streaming or historical data. One large computing cluster is used to store and process all of one or more companies’ data.

Internet companies, such as Google, Yahoo or Facebook, are driven by new business models for which existing technology was not suitable. This led to the development of new technologies known under the common umbrella of NoSQL. Furthermore, there has been the need to integrate them in a flexible standardized architecture to enable Big Data. The lambda architecture is such an architecture and has been coined recently by Nathan Marz and James Warren.

It has the following key features:

  • Standardized fault-tolerant distributed file system that spawns across the whole cluster – this file system is the base of the data lake that I will explain later.
  • A batch processing layer for processing large amounts of historical data stored in the computing cluster
  • A serving layer for providing fast access to results of batch processed data
  • A real-time processing layer (or “speed layer”) for “live” processing of data streams, such as sensor data or stock market data
  • A long term storage layer optimized for extremely cheap storage of data that is rarely used (e.g. for legal reasons). Usually you do not find this in other articles describe lambda architecture, but I think it is an important feature to highlight. Here you have very old data (more than multiple years) that you do not need in your day to day business – you can store them on very cheap hardware with a lot of disk space but much less computing power and memory capacity.

These features are not new and have been addressed partly also by other architectures known in other domains, such as Business Intelligence, Complex Event Processing, Data Warehouse or Master Data Management. However, the lambda architecture addresses them in context of huge data volumes, diversity of data formats (polyglot persistence) and integrates them all in one architecture.

The term “lambda” stems from the following function used for doing analytics in context of Big data:

query = λ(all data) = λ (live streaming data) * λ (historical data)

Basically it say that all analytics functions λ combining live streaming data and historical data can be computed on systems implementing the lambda architecture. I will later discuss the implication of this for the implementation of the architecture.

The lambda architecture is illustrated in the following figure

lambdaarchitecture

The lambda architecture provides the data scientist means and tools to analyze any data occuring in the company, whereby tools can be easily plugged into the architecture without requiring later major implementation efforts.

Machine learning components can autonomously leverage the lambda architecture to do prediction and automatically implement actions. This is known as predictive and prescriptive analytics.

Data Lake

One of the most interesting aspect of the lambda architecture is that you have a cluster of nearly unlimited storage and memory capacity. You can have even an in-memory database with a memory capacity on the terabyte to petabyte scale distributed over the whole cluster. Popular open source frameworks, such as Hadoop, allow you to use commodity hardware, so that deploying such an architecture can be relatively cheap and they have already built-in fault-tolerance, so that developers do not need to mess around with it.

With such a large cluster you can create a big data lake in your company (see next figure). Basically all your data ends up in this cluster and all applications including the one in the cloud can share it via simple file system access mechanisms and you can use the computing power of the whole cluster to do analysis. Needless to say that you save a lot of money, because you save a lot of redundant  ETL processes, which all have to be made fault-proof and interact with different systems. Modern Big Data architectures take care of this for you.

datalake

Finally, exchanging data becomes much easier than in a Service-oriented Architecture (SOA), where you need to design interfaces and implement services – here every application simply access the distributed file system in the cluster.

Implementing a Lambda Architecture

There are several things to consider when you implement the lambda architecture. Firstly, you can choose from a variety of components to implement it. For instance, on the open source side Apache Hadoop / Apache Spark is very popular which is used by many companies including all popular Internet companies, such as Facebook or Google. You can also use other open source components, such as Apache Cassandra for batch processing and Twitter Storm for Stream processing. Additionally, you can also use commercial tools, such as SAP HANA Cloud platform. Finally, you can put your lambda architecture completely on-premise, completely in the cloud (see my example with Amazon Elastic Map Reduce, which partly implements a Lambda Architecture) or have some kind of hybrid model. In the following I will describe an implementation using Apache Hadoop and additional tools that can integrate with Apache Hadoop.

Software Components

You can use the following components for implementing the lambda architecture.

  • Standardized fault-tolerant distributed file system: Hadoop Distributed Filesystem (HDFS). You can use also other distributed file systems. The choice of the file system is transparent to the application, i.e. they won’t need to use different APIs for different file systems. Most of the time you will be fine with HDFS, but, for example, cloud providers, such as Amazon, may implement their own that fits to their infrastructure.
  • Batch Processing layer: Here you can use Hadoop Yarn, which is responsible for distributing Big Data Analytics jobs, such as map reduce jobs. Yarn allows you even to “containerize” your jobs, i.e. define CPU, memory and network limitations across the big data cluster for a specific job. This allows you to do proper capacity management – one of the most important aspects of a lambda architecture. If you need in-memory batch processing then you should check out Apache Spark. If you want to have a more generic job control, i.e. because you have other distributed applications around your cluster , not based on the MapReduce paradigm, you can use Apache Mesos.
  • Serving layer: The serving layer provides fast access and advanced query mechanisms for results of batch jobs. Here you can use typical Big Data databases and data warehouses, such as Apache Hbase or Apache Shark (for in-memory access). You will probably have multiple different technologies here according to the polyglot persistence NoSQL paradigm. They offer typical interfaces, such as JDBC or ODBC, to integrate with any application.
  • Real-time processing layer: Although Hadoop can process streaming data, most of the time you will choose a software component supporting complex event processing of live streaming data across your cluster, such as Apache Spark Streaming or Twitter Storm.
  • The long term persistence layer is mostly a hardware choice: Here you need a lot of cheap hard disk space, e.g. by not using SSD flash drives, and little computing / memory power. It is usually a separate cluster connected to the other cluster and it leverages the fault tolerance features of HDFS, such as automated replication of data to several nodes and re-replication in case of node failures.

Furthermore, you can have a lot of other software components that automatically build on the aforementioned core technologies, such as Apache Hive or Apache Shark, a Data Warehouse for Hadoop, or Apache Oozie, which is a workflow tool for complex ETL processes distributed over your data lake.

As mentioned before, there is a wide variety of alternatives that you can use to implement the lambda architecture. The standardized fault-tolerant distributed file system is most of the time the base for everything and you can also gradually evolve your architecture and implement it using different components.

Delivery Pipeline

I briefly described before that capacity management is an important part of the lambda architecture. You need to define how big data jobs are programmed and tested as well how they get into the cluster. I expect that in the future not only programmers, but also business people, such as data scientist will need to load big data jobs in your cluster. This means you will need to (1) properly define your delivery pipeline (2) implement and enforce proper capacity management and (3) have a bullet-proof dependency management for different software versions in your cluster.

Luckily by using Apache Yarn or Apache Mesos together with a cluster monitoring software, such as Ganglia, you can do proper capacity management.

Recently, more tools, such as Docker, using advanced virtualization features of the Linux kernel (cgroups) have emerged making capacity management even more easier and flexible. These technologies also have built-in dependency management to avoid a library/versioning hell. Google developed an open source scheduling system, called Kubernetes for them.

Combining Stream-Processing and Batch Data

One core goal of the lambda architecture is to integrate live streaming and batch processing. In fact, most of the recent articles on lambda architecture are just about providing both as software components. However, you will also need to integrate this on the query level, because complex event processing queries are a little bit different from batch processing queries.

Spark Streaming demonstrates how you can join historical data with stream processed data at the same time.

Hardware Components

Hardware considerations for a lambda architecture have – if at all – only been briefly discussed in most of the publications. Hardware planning is important for your cluster – we have seen this already with the long term storage. Furthermore, if you have in your big data cluster a few very old machines than this will affect all jobs running on your cluster. You will need to have proper monitoring tools and rules deployed to identify automatically these kind of bottlenecks.

Conclusion

Once you have implemented the lambda architecture you will need to teach everybody to use it. You will need to plan migration of datas torage for analytics from the individual systems to your data lake, i.e. your big data cluster. Keep in mind that the lambda architecture is about analytics. Although it is possible to include transactional systems into this (e.g. a MySQL Cluster), you will probably still use for your individual ERP systems, CRM systems etc. standard transactional databases of which you extract the data in put them into the cluster for analytics.

However, there are also other tools for doing distributed transactions, such as CloudTPS or even more advanced the Bitcoin transaction system. They may replace individual transactional databases in the future.

More and more companies are embarking on the journey of a standardized Big Data architecture each year. Most of them use open source technologies to gradually migrate towards one big data lake as it has been described here.

DevOps for your business? – About Uniting Development and Operations

DevOps has become in recent years a term for a new paradigm of integrating and managing development as well as operations of software within and cross organizations. I will describe in this blog entry what DevOps is and relate it to existing methodologies, such as agile development, and organizational structures. Basically, DevOps is a broad term that summarizes a set of best practices supported by a high degree of automation of all development and operational processes around the software delivery process using advanced public and/or private cloud technologies. I will conclude with a brief summary of the impact of DevOps on Big Data applications.

The Situation

Many companies have separate development and operations departments. Both usually work under high pressure to deliver and operate applications for business processes of strategic importance.

In recent years it has been shown that the development department has to work closer with internal and external customers to develop the right solutions that the customer can accept. Agile methodologies have been advocated to manage problems given the uncertainty of the business environment where the future solutions should be deployed and/or the lack of understanding of customer requirements or IT requirements. These agile methodologies broke with the paradigm to have a clearly defined long-term process (e.g. waterfall model) where the customer is only involved in the beginning and – often – too late in the end, so that mistakes where costly to correct.

At the same time, the operations department faced similar challenges as the development department. Given the uncertainties of the business environment, the customer ask a lot of new services and IT infrastructure changes, but there was little understanding on the customer side what effects they have and created high pressure on the operations department which has usually a low budget to deal with these changes. Hence, a clearly structured and governed process was needed to handle customer requests. Thus popular IT Service Management frameworks, such as the Information Technology Infrastructure Library (ITIL), were born and used globally. Since critical business applications are operated, the operations department tends to be much more risk avers and it tries to avoid unknown technologies.

It can be observed that both departments had needed and implemented different approaches to deal with the customers. This has lead to the problem that both departments were not only only divided from an organizational perspective, but also from a cultural one. For instance, the development department developed with little consultation of the operations department technology and after some time they just threw a complex piece of software over the “fence” and told the operations department to operate it. However, since they did not collaborate a lot there have been a lot of (extremely) costly problems during software delivery and operations. For example, the different environments, such as development or test environments, did not match the required infrastructure. Newest updates to fulfill requirements could not be installed fast enough, so that development was delayed. Operational staff required training and this was not considered. Many more problems occurred, because there was a strong interdependency between both departments due to the software delivery process.

Recently, DevOps have been pushed to handle this lack of collaboration between the two departments with respect to the software delivery process. Additionally, it address the challenges of new technologies, such as the cloud and software-defined infrastructures, which require strong development skills in development and the operations department.

DevOps a New Paradigm?

DevOps is not a clearly defined term and there is no reference model, such as ITIL, behind it. However, we can identify some common aspects that can be find in many different papers on the topic (cf. Gartner, this blog entry or this article):

  • A clearly defined as well as highly automated continuous software delivery pipeline across different software environments and the development as well as operations departments
    • Describe involved stakeholders with roles and responsibilities.
    • Defined and measured key performance indicators, e.g. “after committing some software source code in the development environment it is fully tested as well as deployed in the production environment and can be used by business processes within one hour”.
    • Clearly defined environments, e.g. development environment, test environment, acceptance environment and production environment. This implies a unambiguous and idempotent description (i.e. a set of scripts) of how they are created, operated, destroyed and what virtual resources (computation, memory and network) they require. This means you have to fully leverage private and public cloud technologies. Fully virtualized environments can be created by anyone using just a graphical browser interface (see next section).
    • Avoid reconfiguration of software for different environments. Environments should be ideally the same (e.g. same network addresses and same hardware) .
    • Continuous integration of software components delivered by different teams.
    • Fully automated deployment procedure in different environments. Manual deployment requiring human intervention is forbidden.
    • Fully automated regression, integration and acceptance testing. Manual test activities should be reduced to nearly zero, because rapidly changing economic environments require rapid deployment of new solutions.
    • Test-Driven development: develop the test of parts of software before you develop the software itself.
    • Deploy software in production incrementally in small chunks often (e.g. each week) as it is required when using Service-oriented Architectures (SOA). Then you avoid to make mistakes and changes cannot have catastrophic impact if they do not work. If you continue the traditional way of deploying large chunks of software with a lot of changes every few months you will continue the pay the price of production outages or obstacles to your business processes.
    • Have a consistent fully automated monitoring approach for your software environments. Leverage machine learning techniques to predict software problems before they happen.
    • Allow each stakeholder (development, test, operations and customer) to use a current build of the software deployed in a virtual environment by just using a browser interface
  • Integration of people from the operations department in the agile software development process. Operations has development skills and development has operation skills.
  • Integration of people from the development department in strategic IT operations processes
  • Clearly defined governance structures – have a sound program and project management
  • Integration of best practices, such as ITIL or agile methodologies
  • Fehlerkultur (error culture): Do not blame mistakes on each other, but solve them collaboratively. Finding out whose fault it was costs time and money and is usually not important – just solve them together as they occur. Be positive about errors and confident about handling them. Make error management part of your daily life. Each error is an opportunity to learn for all involved parties.

DevOps is NOT about flat organizations. Flat organizations make only sense if your organization has just one business area and your employee structure is relatively homogeneous and not heterogeneous. This does not imply that a strict hierarchical model is better, but you need to combine carefully control and agility.

Is DevOps right for your Organization?

Every organization should leverage and adapt DevOps practices, because they lead to significant benefits for all involved stakeholders (cf. also here). It makes sense for startups as well as large corporations. However, there are exceptions, such as research & development of disrupting new technologies. When researching & developing completely new technologies, where the impact on the infrastructure is completely unclear or where you just want to explore very risky software, where you are not certain that you will use it in the future, the overhead of a DevOps organization is too high. From my experience, there you have small teams, which are – on purpose – separated from the others to develop a new way of thinking and using novel technologies. They have to redo very often everything from scratch potentially using very diverse software technologies in a short time frame. Nevertheless, they may still employ some DevOps aspects, such as full computing, memory and network virtualization provided by cloud technologies.

What tools can I use?

You will find a lot of tools for enabling DevOps in your organization. Indeed tools are an important aspects, because you have a high degree of automation of previously manual activities and you will manage your environments using cloud virtualization technologies.

However, do not forget that it is also about organization and culture. Simply having the tools won’t help you much.

I present here only few of the many tools that can be used.

Cloud

As I mentioned before, you want to create and manage software environments fast and in a highly automatized fashion. Each environment should ideally be identical to avoid errors due to reconfiguration of software. This means they should have the same underlying software, configuration, (virtualized) hardware and the same network configuration (including the same IP-Addresses). Large-scale public cloud providers, such as Amazon EC2 or Google Compute, already have technologies that make it feasible. If you prefer a private cloud then you can use OpenStack, which is a Linux-based Cloud Computing distribution, offering similar functionality as Amazon EC2. It offers a web interface for creating new environments in a browser.

Additionally, you can use tools, such as Vagrant, to automate creation and management of software environments.

Continuous Integration

Continuous integration supports the integration of different software components from different teams. It builds and tests automatically complete applications every time a new piece of software source code is committed to the version management system, such as GIT. Popular tools are Jenkins or Travis.

Besides normal unit testing, integration testing, one should look at acceptance testing tools, such as Cucumber. There, acceptance tests can be described by business users in (nearly) natural language. This makes them repeatable and reliable.

Continuous Deployment

Deployment should be automated and repeatable independent of the environment. This means no manual configuration or manual deployment steps. You have to develop scripts that ensure that the target system is in the desired state after a deployment – independent of the state it is currently in. Popular tools, such as Puppet, Chef or Vagrant help you with this task.

Recently, web interfaces have become popular, so you can do deployment of complex applications just using your browser (cf. Ubuntu Juju).

Monitoring

Basically your monitoring infrastructure has to collect all the messages from various applications that are deployed in an environment and be able to take action on this. Usually there are a lot of message in various text formats. This means you have to find the right tools to collect and analyze a large amount of data.

You should not only present results (e.g. critical conditions), but also be able to automatically handle them (e.g. repair broken applications). Amazon OpsWorks is one product that can do this for application deployed on the Amazon EC2 cloud.

An interesting application is the Netflix ChaosMonkey. It is an excellent example for a tool supporting the Fehlerkultur (error culture) mentioned before. Basically, it switches off machines on which your software is deployed at random, but only in a certain time frame (e.g. from 9 – 17 o’clock). This means errors will be detected more easier and can be handled when all employees of the company are there. Hence, you will have no/less errors on Sunday morning, where it is difficult to get the right people to work on a problem. It should be noted that Netflix, a media streaming service, requires strict quality of service and it cannot allow itself that, for example, streaming of media to the customer is interrupted or disturbed. Nevertheless, the ChaosMonkey is switching off machines in their production environment.

The Netflix ConformityMonkey and JanitorMonkey check if the state of the systems is still acceptable or degraded. If it is not acceptable any more then the instances are automatically switched off and rebooted, so an acceptable state is available all the time. Furthermore, they switch off unused instances to reduce costs.

Recently, predictive software maintenance has become a hot topic, where you predict when your application will fail or slowed down given the environmental conditions before the event has taken place.

DevOps and Big Data

Big Data is about processing a large amount of data of different nature (e.g. structured or unstructured) in acceptable time. There is a trend to leverage big data to enable new sophisticated statistical models of complex real-world dependencies. For instance, you want to predict what customers want to buy next or when, for example, company cars are likely fail – even before the event happens.

Having DevOps is mandatory for Big Data. You will have a limited set of resources, you have very business critical applications and change will come often, e.g. new prediction models. You may also need to do experiments in production environments. For instance, you want to evaluate new machine learning technologies involving real live business data.

Furthermore, another challenge of which many people are not aware of yet is that business user will actually develop code and deploy it in production. For example, marketing specialists will develop and test new models using R. Similarly, hardware engineers will develop and deploy prediction algorithms for the reliability of hardware assets. This code will become part of existing critical business applications.

For example, you own a bus company. Your mechanics have a good understanding of the probability distributions for hardware errors, gas consumption etc. They will develop a model in a statistical programming language, such as R, to predict failure and maintenance needs of your bus based on sensor data as well as maintenance reports. The output of this is used by the scheduling application developed by your development department to avoid delays of bus service delivery. Similarly, a statistical marketing program developed by your marketing department will predict how much customers it expects when based on historical data, but also include current events (e.g. soccer match, news or tweets). This can also be included in the scheduling application. It is not very realistic that the marketing department will ask the development department to implement a statistical model proposed by them. The overhead is simply too large for experimental Big Data applications.

In fact, this has happen already a long time ago in the finance industry where business analysts created macros in Excel/Access or other office software to support calculation of their complex financial statistic models for creating more and more complex financial products. There the problem was that you have a variety of software somewhere in the business with critical impact no versioning or backup and unknown dependencies to other software or data. This is obviously bad and can even lead to a bankrupt.

How this can be handled has not been yet subject to extensive research or experiments.