Monitoring CI flakes
Team Glean’s Alyssa Burlton explains how our CI pipeline works, and how we keep it moving.3 min read Published: 17 Aug 2021
Our CI Pipeline
At Glean, the CI (Continuous Integration) pipeline for our core web app does quite a lot, including:
- Running unit tests for the frontend and backend services
- Building all the corresponding artifacts and publishing them as Docker images
- Deploying these images into a Kubernetes cluster and ensuring they’re serving traffic ok
- Running “end-to-end” tests against the resulting environment via Cypress
This gives us a great level of confidence when we’re developing Glean, but doing all of these things also increases our risk of the pipeline becoming too slow or flakey to be anything other than an immense source of frustration for developers.
How we’ve dealt with it
As the codebase and pipeline have grown and matured, we’ve found the need to keep returning to it now and then in concentrated bursts of effort to keep it healthy. We were doing this on a fairly ad-hoc basis - pretty much when enough complaints bubbled to the surface for us to feel that the pipeline had become “too painful” again, which was usually much later than we would ideally have spotted the underlying issues.
After the latest batch of fixes, we began to think about how we might collect meaningful data on this flakiness so we weren’t just fixing in the dark all the time. What would an acceptable percentage of build flakes be for us? What about average build times? And so our “CI Monitor” was born.
How it works
The first question we needed to answer was: how do we measure flakiness?
Clearly it can’t be as crude as the number of failures across all branches, because this will include a bunch of failures on branches that had actual problems with them.
We could restrict it to looking at just the failure rate of main, but this would also be imperfect - we’d still get some noise from the occasional bad merge, and we also wouldn’t get a particularly large sample size over the course of a normal week.
Instead of the above approaches, we decided to create a dedicated ci-monitor branch, which would build on a cron schedule throughout the week to give us lots of juicy data. It was a simple tweak to our Jenkinsfile to set up the cron behaviour:
Next, we introduced a git tag called latest, to be applied to the Glean repository whenever a build of main passes. This essentially marks the last stable commit, which the ci-monitor can safely update itself from without worrying about bad merges. The bash responsible for creating and updating the branch is below, which we have running in a Github Actions workflow:
We have another Github Actions job that runs on a Friday evening to delete the branch, which allows some automated cleanup to clear stuff down from the week’s build - published Docker images, AWS (Amazon Web Services) resources and the like.
Now we had our guinea pig branch, the next step was to get the raw stats somewhere that could be processed. We ended up using Prometheus for this (specifically, a fork of https://github.com/lovoo/jenkins_exporter), capturing simple stats about each build including the duration, pass/fail status and the stage it failed on if applicable.
Finally, it was over to Python to knock together some simple code to process the data each week and post it into Slack. I’ve included some snippets of this below to give the general idea:
How it’s going
And that’s it! The team now gets a weekly message in Slack with reliable data on how our Glean pipeline is faring. If we see the pass rate or build times start to worsen, we can nip it in the bud by taking a look at the failed builds and fixing up anything that’s regressed.
Could you be our next dev?
At Team Glean, we're always on the lookout for talented people to join us.
To learn more about working with us, and to see the latest opportunities, follow the link below!
More from Tech BlogView All
Glean vs. the universe
In which developer Chris Moore persuasively argues that a recent Glean glitch was caused by… the sun?
Migrating a Flutter app to Null Safety
In this post, Team Glean’s Cameron talks us through the benefits and drawbacks of the ‘null safety’ upgrade to Dart.
How we discuss engineering improvements at Glean
Team Glean's Gwen talks us through Engineering's regular 'Entmoot' and how it could help your dev team work smarter