Optimizing our deployment pipeline — Part 1

Minimizing feedback loops and bottlenecks

When I started working for Highsnobiety five months ago, I was asked to look into our deployment and quality assurance (QA) process, because we only had one QA environment where we merged all our features to review them before going live. This often led to unforeseen behaviour in some test cases. But let me start from the very beginning.

Today’s article is the first in a short series of articles. As soon as each one is published you’ll be able to find all articles in this series linked here:

We are working in bi-weekly sprint cycles applying a loosely adapted Scrum methodology. Our legacy codebase has a weekly deployment rhythm, while our other services can be deployed continuously. In each sprint planning, we have a rotating deployment responsibility for two people — our “deploy peers”. They manage the whole process from publishing the code over to monitoring the site and system afterwards. Code that is being deployed has passed the quality criteria which is defined in our Definition of Done (DoD), that includes the following rules:

  • Unit and Integration tests running above 80% (we’re still very tolerant)
  • The code is reviewed and approved by a peer
  • All changes are tested on relevant browsers

Workflow challenges

The development of a new feature is usually sparked by a stakeholder’s idea. Once the Product Owner has accepted the idea, the user story will be put in a ticket in the backlog and eventually included in the upcoming sprint. As probably every front-end developer knows things are often not as easy as they seem. In our case, we have two separate feedback loops: DoD and QA.

Figure 1: A pretty standard development process

Everything we develop is measured against the user story and its acceptance criteria. As of now, we process DoD and QA sequentially, because we do not want our unreviewed and untested code to be merged. This sequential processing prevents us from picking up speed. Let’s think further: What happens if all developers are either sick or busy? Let’s think one step further: What would happen if all developers were either busy or on sick leave? QA would then have to wait for a developers’ merge, which would present us with our first bottleneck. Alternatively, if we didn’t have to merge our code, we could do a code review AND have the reviewers take a look.

Challenge #1: Code-Review is blocking review process

One of our main challenges is that non-developers cannot run the new features on their own machines. At many companies, you have one QA environment where you can deploy specific tickets that need to be reviewed by non-developers. But wait: If we have one environment to rule them all, what happens to changes colliding with each other? For instance, if you have deployed a bugfix for the main navigation, but another feature has been deployed on QA afterwards, which also affects the main navigation. Your bugfix needs to go live ASAP and you already told the stakeholder that they can have a look on QA. Bummer, your changes are gone and you need to re-merge your code.

Figure 2: What our review process looks like
Challenge #2: One environment for reviewing different features

Additionally, we do not keep track of deployed features on QA, which can lead to unexpected behaviour. CSS from feature A can positively affect feature B on QA. When deploying tested feature B on production, it might break due to feature A not being deployed, yet.

You still need to pass reviews by design, product, and the stakeholder. They represent our end user and the business side. Usually, there are some proposed amendments by at least one party. I won’t list all the details about how those changes will be applied, but you can probably imagine how time consuming this process is.

Everybody has their own daily work so it might take a while before people can review the story we’re working on. Our Product Owners, for example, are juggling many projects at once and attend regular meetings with various stakeholders. Therefore, it’s possible you could have to wait up to several days before receiving feedback. Having only one QA environment can mean, that changes we merged a few hours or days ago, may be gone when the reviewer looks for them.

Challenge #3: Lack of knowledge about features merged onto QA

An additional point to highlight is that I would like to focus on my work instead of reminding people to review my feature. At the moment, I have to speak to each part of the QA process and tell them that a new feature has been merged onto QA and can be reviewed. The process of initiating a review should work automatically.

Challenge #4: Asynchronous and manual communication between reviewers and developer

Every modification proposal we receive restarts the development loop from the beginning, going all the way back to the developer’s IDE and passing DoD again.

TL;DR:

  • Code-Review is blocking the review process
  • One environment for reviewing different features
  • Lack of knowledge about features merged onto QA
  • Asynchronous and manual communication between reviewers and developer

The natural question that follows is: Why don’t we automate parts of this process? Why don’t we have an easier way of getting things done? We could, probably.

Our Solution

Let’s find a solution to the problems at hand, starting with our first problem: Having only one environment for testing.

After development is done, a process should be in place, which deploys the current state of the application on an independent server that is accessible by the reviewers. We call it Preview Service. Every time a pull request is created or updated (development iteration is done), a link is automatically posted in the corresponding ticket. Everyone following this ticket (stakeholder, product, design) will be notified about this update.

https://xyz.highsnobiety.com/{namespace}/{repository}/{container-tag}

Access is granted upon entering credentials, so no outsider can sneak a peek at our upcoming features. By visiting a preview link for the first time, the Preview Service launches an instance based on the Docker image of the current branch. This can take up to a few minutes, depending on the container boot up time. While booting, the user receives real-time feedback on the service’s status.

Once the temporary preview environment is up and running, the loading page redirects the user to the service’s entry point. For every version, the current revision is removed and another instance is launched. Having the Preview Service in place will solve our problem of code review blocking progress on QA. In addition, no more communication is needed, because every party is notified automatically. This is also beneficial for the QA of more advanced features, which are executed by an external team through crowdsourcing. They need up to three days to test a new version of our software. At the moment this freezes the environment completely, blocking it for other QA purposes. In the future this would no longer be an issue.

Figure 3: What it should look like after implementing Preview Service

TL;DR: We will create a service which creates a temporary isolated environment just displaying one feature at a time. Communication will be inside the ticket, where the links to the Preview Services are automatically posted. This will increase the number of deployments since features can be tested in isolation. Its goal is to also raise developers’ motivation because those feedback loops feel less time consuming.

What’s next?

In part two of this series, we will dive deeper into the actual implementation of our solution with Docker, Kubernetes, and Elixir. Our main focus will be on web services because they have one communication protocol (HTTP) in common.

Make sure to subscribe to Highsnobiety Tech to not miss upcoming parts of this series!