If you’ve tried Kubernetes distributions like OpenShift but are still struggling to realise the benefits of DevOps and Continuous Delivery, then you are not alone. We’re finding that most companies (including large banks) who buy these Platforms aren’t achieving rapid time-to-value, zero down-time, and regular deployments to live. This seems crazy because, on the face of it, they give you so much out-of-the-box tooling. For example, OpenShift includes installers for all major cloud platforms, monitoring, container management and logging. It even gives you automation tools so you can create container images; deploy these through pipelines; templates to create services, build configurations, and deployment configurations. It should all be there for you and yet still we see companies with seemingly mature processes only delivering every 3 or 4 months. What’s gone wrong?
The software we write has value only when it’s being used in production. Until then, features are just “work in progress” as they await peer review, QA, promotion between environments, acceptance by the business, or a release date. They have zero value because they generate no benefit to users: they’re not bringing in new revenue, improving customer service, or saving time. What we should be doing is getting this work live quickly without compromising quality. What slows us down is friction and we hit it everywhere from development to the release of software. Friction, waste, work in progress, delays: different philosophies have different names for it, but we’re essentially talking about anything that slows down the path to live. Often this is because of handoffs between teams, with manual steps and poor communication; but if we’re ever going to get to Continuous Delivery, we must first understand where friction occurs.
Let’s start by looking at the software value chain: the set of steps required to build production software. For each step, there will be actors, activities, time taken, artefacts produced and pain points. Everyone in the team (from Devs to Ops teams monitoring live service) has a different perspective on this: a Scrum Master thinks about process; a Product Owner wants to know when features will be available to customers; QAs are constantly thinking about how defects might be escaping their test approach; Architects need to know that agreed SLAs (e.g. performance and security) have been met; DevOps teams invest in the build pipelines. But once we bring the whole value chain together we can see that there is a shared goal: to get units of value through the system as fast as possible without compromising on quality of service – no-one wants to introduce bugs or outages as a price for that speed. Indeed, companies that achieve Continuous Delivery have lower change fail rates, higher quality and lower outages (see Nicole Forsgren’s book: [Amazon] “The Science of Lean Software and DevOps”). People who do this well keep batch sizes small, automate manual processes, and look for continual improvement through fast feedback. Let’s look at how we can apply these principles so that our OpenShift delivery becomes fast and frictionless.
Efficient Delivery of Features rather than Code
What if everything (not just JIRA) but our release tools, our quality gates, and even our development approach was entirely focussed on features rather than code artefacts? We would notice several things:
- We progress entire features, rather than code or containers, through the pipelines together with the (environment) configuration required to run them. This means we never get environment/configuration drift – so there is no time wasted waiting to fix an environment after we promote code that requires new resources.
- Giving all actors (including, for example, Devs, Release Teams, QEs, Product Owners and architects) transparency into what’s running where, not just through release notes, but by tagging features and tracking their promotion through the pipeline means our releases become better understood, more frequent, smaller and less risky. (We’ve built a UI that shows which features are running where.)
- We can also begin to see how long (and where) features are waiting in the pipelines so that we can take action to fix the bottlenecks.
- Developers (and QAs) tag each (e.g. Git) commit to a Jira (/ Trello etc.) card – it’s easy to do and is enforced through pre-commit or pre-receive hooks. This helps us string together all the elements (code, test, resources, environment info) that are required to support each feature.
- If something goes wrong, we revert an environment back to a given snapshot including both the configuration and complete features running within it – see Reversion. This feature- rather than code-based approach ensures that we revert to something meaningful and consistent.
- Finally, we use Product Streams to manage the development and deployment of multiple software branches, e.g. to roll out fixes for Release 1 of a product, whilst Release 2 is still in development
This is a different way of looking at development and we’ve built a UI to support it, backed by Jenkins pipelines processes and test automation. The first reaction of most teams is to look for a microservices view, e.g. containers and versions within the pipeline. We give you that, but also a view showing which features are running where: it’s what we discuss at Sprint Planning, Backlog Refinement sessions, in Daily Stand-ups and at Sprint Reviews, so why not look at the delivery pipelines and environments in the same way?
The Feature View also allows (users with correct permissions) to push sets of features through the pipeline. Why sets? Because we want to make sure that any given environment is consistent and that all the services and features within it will work together. Once all tests have been passed, then we can promote the entire set of features to the next stage of the pipeline.
Now we have a feature-, rather than code-, perspective on delivery let’s see how we apply those core DevOps principles.
Small batch sizes
Thinking (and managing) software as features rather than code fragments or containers, helps us deliver small and manageable batches: Product Owners create small and independent user stories, and our code is their implementation. Indeed, the entire Scrum process is based on small, potentially shippable increments of code: we just foster and encourage that through our feature-driven CICD processes. In fact, we do this so well, we need to ensure fast feedback so that we can spot problems before the next increment is ready to progress. We also need tools to monitor and roll back releases (with zero downtime) if we do hit a problem. This is OK – it’s a price we pay for speed, but it is one that we can easily automate away through our framework.
Fast Feedback
When there is a problem, it’s good to know about it fast so we can fix it. Take software development: Devs make mistakes and introduce bugs which, if missed by unit tests, are hopefully picked up by QEs. The longer the gap between introducing and identifying the bug, the longer it takes for the Dev to get up-to-speed before finding a fix. If bug information is incomplete, it takes longer. We, therefore, include the following innovations to help eliminate friction and encourage fast feedback:
- Devs add hooks to their microservices so that the QEs can plan and link their tests (e.g. Gherkin) at the beginning of each Sprint. We won’t go into detail here about Data Loaders, Selenium Page objects, and Data Objects, but we have a framework that means that QEs can write their tests at the same time as the Devs write the code so our Sprint Definition of Done includes all our BDD testing.
- We add OpenTracing libraries into all our code so that we can trace interactions between collaborating microservices. This means that when QEs find a bug they can report not only the component that throws the error, but also the trace and context of that bug, e.g. service A called service B with these data. We’ve found this means that bugs are fixed much quicker.
- We make sure containers are secure and valid before they get to QEs. Because we’ve automated the creation of environments (see later), we can create and tear down container isolation environments which exist (for seconds or minutes) to test each container as it is built. For example, we run automated vulnerability tests to ensure that the container is valid before tearing down the environment and progressing that container to the next environment. This near-instant feedback on containers allows the Devs to fix problems as they occur and avoids wasting QE time.
- Monitoring. In addition to the Jaeger monitoring of OpenTracing events, we integrate with standard tools such as Prometheus to check the health of our OpenShift deployments and to alert us if a release is causing trouble. We can then roll back a release (using our Blue/Green deployment strategy to ensure zero downtime) to flip live back to a stable point.
Continual Improvement
You can’t future-proof your development process, you can only make it easy to adapt and improve. We do this by using a declarative approach to environments, pipelines and quality gates so you can change your process as your context evolves. For example, if you decide you need a new step in the pipeline – perhaps you’ve added a Security Test environment – we update the pipeline declaration (it’s a YAML file) with the new environment name, the resources it needs, the quality gate that guards access to it, the role that can promote code into it, and how it links to the stages before and after it. This takes perhaps a minute (assuming we’re not adding any new resources for which we don’t already have Ansible roles) and the Boost automation does the rest including the creation of OpenShift environments and artefacts, wiring in the Jenkins jobs and updating the UI console. Our process is therefore adaptive not only to your organisation and the product streams you are delivering but also to changes within those product streams.
Conclusion
Continuous Delivery with OpenShift should be easy to implement – this is what Boost gives you from Day 1 – but also easy to evolve. In some ways, it’s the Day 2 requirements that kill you and it’s by using a declarative approach to define your delivery process that we are able to keep pace with change. There’s a lot of automation behind the scenes with Boost, but this is so that everyone in your organisation can get on with delivering value to your customers as quickly and efficiently as possible.