DevOps for your business? – About Uniting Development and Operations

DevOps has become in recent years a term for a new paradigm of integrating and managing development as well as operations of software within and cross organizations. I will describe in this blog entry what DevOps is and relate it to existing methodologies, such as agile development, and organizational structures. Basically, DevOps is a broad term that summarizes a set of best practices supported by a high degree of automation of all development and operational processes around the software delivery process using advanced public and/or private cloud technologies. I will conclude with a brief summary of the impact of DevOps on Big Data applications.

The Situation

Many companies have separate development and operations departments. Both usually work under high pressure to deliver and operate applications for business processes of strategic importance.

In recent years it has been shown that the development department has to work closer with internal and external customers to develop the right solutions that the customer can accept. Agile methodologies have been advocated to manage problems given the uncertainty of the business environment where the future solutions should be deployed and/or the lack of understanding of customer requirements or IT requirements. These agile methodologies broke with the paradigm to have a clearly defined long-term process (e.g. waterfall model) where the customer is only involved in the beginning and – often – too late in the end, so that mistakes where costly to correct.

At the same time, the operations department faced similar challenges as the development department. Given the uncertainties of the business environment, the customer ask a lot of new services and IT infrastructure changes, but there was little understanding on the customer side what effects they have and created high pressure on the operations department which has usually a low budget to deal with these changes. Hence, a clearly structured and governed process was needed to handle customer requests. Thus popular IT Service Management frameworks, such as the Information Technology Infrastructure Library (ITIL), were born and used globally. Since critical business applications are operated, the operations department tends to be much more risk avers and it tries to avoid unknown technologies.

It can be observed that both departments had needed and implemented different approaches to deal with the customers. This has lead to the problem that both departments were not only only divided from an organizational perspective, but also from a cultural one. For instance, the development department developed with little consultation of the operations department technology and after some time they just threw a complex piece of software over the “fence” and told the operations department to operate it. However, since they did not collaborate a lot there have been a lot of (extremely) costly problems during software delivery and operations. For example, the different environments, such as development or test environments, did not match the required infrastructure. Newest updates to fulfill requirements could not be installed fast enough, so that development was delayed. Operational staff required training and this was not considered. Many more problems occurred, because there was a strong interdependency between both departments due to the software delivery process.

Recently, DevOps have been pushed to handle this lack of collaboration between the two departments with respect to the software delivery process. Additionally, it address the challenges of new technologies, such as the cloud and software-defined infrastructures, which require strong development skills in development and the operations department.

DevOps a New Paradigm?

DevOps is not a clearly defined term and there is no reference model, such as ITIL, behind it. However, we can identify some common aspects that can be find in many different papers on the topic (cf. Gartner, this blog entry or this article):

  • A clearly defined as well as highly automated continuous software delivery pipeline across different software environments and the development as well as operations departments
    • Describe involved stakeholders with roles and responsibilities.
    • Defined and measured key performance indicators, e.g. “after committing some software source code in the development environment it is fully tested as well as deployed in the production environment and can be used by business processes within one hour”.
    • Clearly defined environments, e.g. development environment, test environment, acceptance environment and production environment. This implies a unambiguous and idempotent description (i.e. a set of scripts) of how they are created, operated, destroyed and what virtual resources (computation, memory and network) they require. This means you have to fully leverage private and public cloud technologies. Fully virtualized environments can be created by anyone using just a graphical browser interface (see next section).
    • Avoid reconfiguration of software for different environments. Environments should be ideally the same (e.g. same network addresses and same hardware) .
    • Continuous integration of software components delivered by different teams.
    • Fully automated deployment procedure in different environments. Manual deployment requiring human intervention is forbidden.
    • Fully automated regression, integration and acceptance testing. Manual test activities should be reduced to nearly zero, because rapidly changing economic environments require rapid deployment of new solutions.
    • Test-Driven development: develop the test of parts of software before you develop the software itself.
    • Deploy software in production incrementally in small chunks often (e.g. each week) as it is required when using Service-oriented Architectures (SOA). Then you avoid to make mistakes and changes cannot have catastrophic impact if they do not work. If you continue the traditional way of deploying large chunks of software with a lot of changes every few months you will continue the pay the price of production outages or obstacles to your business processes.
    • Have a consistent fully automated monitoring approach for your software environments. Leverage machine learning techniques to predict software problems before they happen.
    • Allow each stakeholder (development, test, operations and customer) to use a current build of the software deployed in a virtual environment by just using a browser interface
  • Integration of people from the operations department in the agile software development process. Operations has development skills and development has operation skills.
  • Integration of people from the development department in strategic IT operations processes
  • Clearly defined governance structures – have a sound program and project management
  • Integration of best practices, such as ITIL or agile methodologies
  • Fehlerkultur (error culture): Do not blame mistakes on each other, but solve them collaboratively. Finding out whose fault it was costs time and money and is usually not important – just solve them together as they occur. Be positive about errors and confident about handling them. Make error management part of your daily life. Each error is an opportunity to learn for all involved parties.

DevOps is NOT about flat organizations. Flat organizations make only sense if your organization has just one business area and your employee structure is relatively homogeneous and not heterogeneous. This does not imply that a strict hierarchical model is better, but you need to combine carefully control and agility.

Is DevOps right for your Organization?

Every organization should leverage and adapt DevOps practices, because they lead to significant benefits for all involved stakeholders (cf. also here). It makes sense for startups as well as large corporations. However, there are exceptions, such as research & development of disrupting new technologies. When researching & developing completely new technologies, where the impact on the infrastructure is completely unclear or where you just want to explore very risky software, where you are not certain that you will use it in the future, the overhead of a DevOps organization is too high. From my experience, there you have small teams, which are – on purpose – separated from the others to develop a new way of thinking and using novel technologies. They have to redo very often everything from scratch potentially using very diverse software technologies in a short time frame. Nevertheless, they may still employ some DevOps aspects, such as full computing, memory and network virtualization provided by cloud technologies.

What tools can I use?

You will find a lot of tools for enabling DevOps in your organization. Indeed tools are an important aspects, because you have a high degree of automation of previously manual activities and you will manage your environments using cloud virtualization technologies.

However, do not forget that it is also about organization and culture. Simply having the tools won’t help you much.

I present here only few of the many tools that can be used.

Cloud

As I mentioned before, you want to create and manage software environments fast and in a highly automatized fashion. Each environment should ideally be identical to avoid errors due to reconfiguration of software. This means they should have the same underlying software, configuration, (virtualized) hardware and the same network configuration (including the same IP-Addresses). Large-scale public cloud providers, such as Amazon EC2 or Google Compute, already have technologies that make it feasible. If you prefer a private cloud then you can use OpenStack, which is a Linux-based Cloud Computing distribution, offering similar functionality as Amazon EC2. It offers a web interface for creating new environments in a browser.

Additionally, you can use tools, such as Vagrant, to automate creation and management of software environments.

Continuous Integration

Continuous integration supports the integration of different software components from different teams. It builds and tests automatically complete applications every time a new piece of software source code is committed to the version management system, such as GIT. Popular tools are Jenkins or Travis.

Besides normal unit testing, integration testing, one should look at acceptance testing tools, such as Cucumber. There, acceptance tests can be described by business users in (nearly) natural language. This makes them repeatable and reliable.

Continuous Deployment

Deployment should be automated and repeatable independent of the environment. This means no manual configuration or manual deployment steps. You have to develop scripts that ensure that the target system is in the desired state after a deployment – independent of the state it is currently in. Popular tools, such as Puppet, Chef or Vagrant help you with this task.

Recently, web interfaces have become popular, so you can do deployment of complex applications just using your browser (cf. Ubuntu Juju).

Monitoring

Basically your monitoring infrastructure has to collect all the messages from various applications that are deployed in an environment and be able to take action on this. Usually there are a lot of message in various text formats. This means you have to find the right tools to collect and analyze a large amount of data.

You should not only present results (e.g. critical conditions), but also be able to automatically handle them (e.g. repair broken applications). Amazon OpsWorks is one product that can do this for application deployed on the Amazon EC2 cloud.

An interesting application is the Netflix ChaosMonkey. It is an excellent example for a tool supporting the Fehlerkultur (error culture) mentioned before. Basically, it switches off machines on which your software is deployed at random, but only in a certain time frame (e.g. from 9 – 17 o’clock). This means errors will be detected more easier and can be handled when all employees of the company are there. Hence, you will have no/less errors on Sunday morning, where it is difficult to get the right people to work on a problem. It should be noted that Netflix, a media streaming service, requires strict quality of service and it cannot allow itself that, for example, streaming of media to the customer is interrupted or disturbed. Nevertheless, the ChaosMonkey is switching off machines in their production environment.

The Netflix ConformityMonkey and JanitorMonkey check if the state of the systems is still acceptable or degraded. If it is not acceptable any more then the instances are automatically switched off and rebooted, so an acceptable state is available all the time. Furthermore, they switch off unused instances to reduce costs.

Recently, predictive software maintenance has become a hot topic, where you predict when your application will fail or slowed down given the environmental conditions before the event has taken place.

DevOps and Big Data

Big Data is about processing a large amount of data of different nature (e.g. structured or unstructured) in acceptable time. There is a trend to leverage big data to enable new sophisticated statistical models of complex real-world dependencies. For instance, you want to predict what customers want to buy next or when, for example, company cars are likely fail – even before the event happens.

Having DevOps is mandatory for Big Data. You will have a limited set of resources, you have very business critical applications and change will come often, e.g. new prediction models. You may also need to do experiments in production environments. For instance, you want to evaluate new machine learning technologies involving real live business data.

Furthermore, another challenge of which many people are not aware of yet is that business user will actually develop code and deploy it in production. For example, marketing specialists will develop and test new models using R. Similarly, hardware engineers will develop and deploy prediction algorithms for the reliability of hardware assets. This code will become part of existing critical business applications.

For example, you own a bus company. Your mechanics have a good understanding of the probability distributions for hardware errors, gas consumption etc. They will develop a model in a statistical programming language, such as R, to predict failure and maintenance needs of your bus based on sensor data as well as maintenance reports. The output of this is used by the scheduling application developed by your development department to avoid delays of bus service delivery. Similarly, a statistical marketing program developed by your marketing department will predict how much customers it expects when based on historical data, but also include current events (e.g. soccer match, news or tweets). This can also be included in the scheduling application. It is not very realistic that the marketing department will ask the development department to implement a statistical model proposed by them. The overhead is simply too large for experimental Big Data applications.

In fact, this has happen already a long time ago in the finance industry where business analysts created macros in Excel/Access or other office software to support calculation of their complex financial statistic models for creating more and more complex financial products. There the problem was that you have a variety of software somewhere in the business with critical impact no versioning or backup and unknown dependencies to other software or data. This is obviously bad and can even lead to a bankrupt.

How this can be handled has not been yet subject to extensive research or experiments.