Getting to Green - the challenges of CI and CD in a large organisation

This is a paraphrasing of a talk I gave at the DevOps Usergroup Malaysia Meetup in July 2012.

The idea of the talk was to present some thoughts on how to move from manual deployment to automated deployment in a larger organisation which has not historically used automation techniques.

I am the CIO at the iProperty Group; we operate sites in 5 countries – Malaysia, Singapore, Hong Kong, Macau and Indonesia. In total we have around 50 IT staff (30 of whom are in Malaysia). We have a mixture of different stacks in use within the organisation (3 of our major sites are written using .Net and run on Windows, 2 are written in PHP and run on Linux).

Our hosting is currently split between co-located facilities (one in KL, one in Jakarta) and Amazon Web Services (in Singapore). We are gradually moving more towards AWS.

We are investing a lot in tools and processes to support our environment such as AWS, Puppet, Continuous Integration (Jenkins), etc.

Ultimately the goal is to have true push-button deployment where any developer can deploy their own code into production with confidence and ease. As with a lot of other companies, we look towards Etsy as a model of how this can be done where developers deploy code on their very first day on the job.

By working on automation of our deployment pipeline, we know we can increase confidence in the products we’re pushing out. We want to improve the reliability and repeatability of our releases. The aim is to get to a state where we can deploy many times a day - this will enable us to increase innovation and reduce frustration.

Of course, it’s never quite that simple and deployment is often one of the overlooked parts of the product development process. Developers assume that once the code is written their job is done.

I’ve had past experience where there was such uncertainty about the quality of software that a total change freeze was put into place for months at a time. This led to a number of consequences: first was that developers became frustrated as products they had put lots of work into were left to rot without going live; the other was that, once the freeze was eventually lifted, the deployment of all the backed up changes was a massive task.

At iProperty we have relied heavily on manual work to get code live. This makes deployments fragile and risky. Moving from one webserver to multiple servers without any automation means that we have problems with synchronisation of code versions, files being missed, problems surfacing in production that weren’t in development and more.

The more you scale, the more automation becomes a necessity. We are currently moving sites to AWS; in that environment you have to assume that an instance will become unresponsive and the easiest way to control that is to build a new instance to replace it. Without automated deployment of the application that’s a tricky task.

So how to go from manual, fragile deployments to robust automated deployments? Our approach has been to take a small project first and prove out the processes and concepts there before taking it to larger, more complicated applications.

The testbed for us is a site that’s relatively new - it’s been live for about 9 months and it has a team of 4 developers. The bonus is that it’s also running in a Linux environment and the tools are more proven than their equivalents on Windows.

To kickstart the project and to help with technical leadership, we asked ThoughtWorks to come in and help for a short engagement. Among their many strengths, they have a lot of experience in deployment automation, continuous delivery and DevOps.

As we worked through the project, what seemed like a simple site on the surface turned out to have some hidden surprises; probably the biggest of these was the number of places the application was making assumptions about the environment (that content stored of the local disk would always be available). We had a bit of work to do to move user-generated content over to Amazon S3 before we could properly set up a test environment.

Once we’d worked out the architecture, the next step was to get Jenkins running as a build and deployment server. We now have 2 Jenkins builds - one is fairly standard (pulls changes from git, runs through the build steps and produces a release tarball as an artefact); the second is manually triggered and is used to deploy the code to the web and application servers. The deployment process works by grabbing the tarball from the standard build, pushing it out to a central deployment host and asking that to then copy the code to all of the webservers. In this way, the only place we need to know about which webservers are currently in the loop is on that central deployment machine.

The other part of the project has been to start to build some automated tests so that we can deploy with more confidence. We’re developing these from the outside-in - starting with functional tests (using Selenium) that prove the most critical features of the site. Ideally we’d have had these in place from the start.

So what have we learnt? Firstly we’ve learnt that achieving your goal is doable but you need to start small and prove it out in a contained way first. Secondly, we learned that there’ll always be details that intend to trip you up so don’t rush it. Thirdly, I’d recommend getting help if you need it - even if you think you understand the task, having an outsider’s perspective gives you balance and focus. And finally, testing is crucial - do it early and save yourself the pain later!

As a CIO, a lot of my job is about managing risk to the business from IT. Pushing out unstable code, not being able to release products due to fragile processes and not being able to scale to meet demand are some of the key risks that we face. Automating deployment can help with all of these.

The talk generated some interesting discussion points; particularly about the path forward. The next big challenge is to prove the tools out to do this with Windows web servers.