Effective CI/CD Strategies for Updating GitOps Repositories
With Git at the heart of GitOps, it is natural to consider adopting the core CI/CD concepts of continuous integration, delivery, and deployment to validate changes to a GitOps repository.
CI/CD for GitOps still needs a strong focus on automation for validation and rollout steps, but dealing with infrastructure significantly affects the dynamics of these steps to the point where a few revisions are in order. The following sections build upon the lessons from my previous article on GitOps repository design and highlight the central adaptations for GitOps CI/CD pipelines.
Lesson #1: Infrastructure is not an app. A declarative way of thinking
The challenge with CI/CD for a GitOps repository is that there is no self-contained binary or package to be transferred over to a runtime environment. The repository has configuration files and instructions for a third-party agent to effect changes in the final medium: the infrastructure.
The starting point for an application is a relatively well-defined runtime, such as a container image or an operating system sitting atop a VM or bare-metal server. For infrastructure, the starting point may be an empty cloud account or a cloud account that already has a few instances of the services covered in the GitOps repository.
In that sense, GitOps practices work best with declarative approaches, with repository contents describing what the deployment should look like, then letting the underlying agents in the infrastructure do their best to match those definitions.
Declarative approaches fundamentally affect the guarantees we usually expect from a CI pipeline.
For example, let’s assume a GitOps repository containing a few folders representing a fleet of Kubernetes clusters:
- A “common” folder containing shared settings across all clusters.
- A separate folder (or configuration file) for each cluster containing only the settings that are exclusive to the cluster.
When someone modifies a shared setting in that “common” folder, applying that change to an environment changes the effective desired state for all clusters. It is not like one can validate the full extent of that change short of actually effecting the change on all those clusters.
Hence, the CI pipeline goal is to validate the declarative statements using resources as similar as possible to those in an eventual production environment, covered in lesson #4.
Lesson #2: Enforce small, independent changes
This lesson affects the continuous integration portion of a CI/CD pipeline. Continuous integration aims at merging frequent code iterations to the main development branch. The foundation of continuous integration is the concept of small, independent code changes validated by comprehensive test automation.
If pull requests accumulate behind sluggish validation sequences, feature branches begin to overlap in a cacophony of merge conflicts, requiring extra validation cycles.
Small, independent changes are an often overlooked aspect of continuous integration. It is easier to break this rule while doing application development, especially when the test automation achieves a certain level of functional coverage that can paper over the occasional pull request containing more than one independent change.
With GitOps, extensive changes to the repository mean multiple parts of the infrastructure changing at once, possibly asynchronously. Test cases are rarely that thorough, and even gradual rollout approaches may fail to deal with the blind spots caused by events taking place out of order. In that sense, with GitOps, it is crucial to formalize the practice of making small changes as part of the CI/CD pipeline and avoid accidentally terraforming a deployment into a whole new world.
As one example of a formalized practice, consider flagging extensive changes in pull requests. All major Git providers support APIs to add various “status” messages to a Git commit, so this is the perfect place to validate whether a pull request is “small.” For example, if the repository design calls for a parent folder for each class of devices, the automation may flag pull requests containing changes to more than one class of devices.
Lesson #3: Beware deceptively small changes in pull requests
Changing a few lines on a blueprint may have an outsized impact on the final product in the real world. Likewise, a few characters in a GitOps repository may trigger massive shifts in the infrastructure, so we need to extend the concept of “small, independent changes” to the impact of a change.
Continuing the previous example — a change to a folder containing shared settings for individual resources — imagine you have a compatibility-threatening configuration change request, such as upgrading Kubernetes clusters to the recent 1.22 release. This release removed support for several long-standing beta APIs, and adopters received extensive warnings and education for several months about the potential disruptions to existing workloads.
Gradual rollouts. While the pull request with the version change to the “common” folder is technically “small,” it probably deserves a more deliberate rollout approach, updating only a few clusters in the earlier stages of the pipeline.
The exact strategy for “deliberate” may vary. For example, you may start with an initial commit to a couple of individual folders mapped to clusters in the “development” stage. It is best to look into tooling specialized in rollout strategies for larger deployments and increased productivity, such as blue-green, canaries, and feature flags.
Lesson #4: Unit testing is integration testing
One of the cornerstones of continuous delivery is an extensive battery of automated tests. Good test automation covers the range of unit, integration, and system tests. With GitOps, unit testing is more challenging because the resource contents are often proxies to infrastructure components. The resources also lean towards declarative approaches using “mini languages,” such as Terraform files and the occasional DDL file for relational databases.
Testing “mini languages” in pull requests is a common dilemma in unit testing because the statements are entangled with specialized backend servers. Think relational databases, messaging servers, and other types of stateful layers. Attempting to simulate the work of these dependencies is often a losing and expensive proposition, so application developers tend to rely on the usage of mock objects and defer the vestigial gap in test coverage to later stages of the CI/CD pipeline.
When it comes to GitOps, that coverage gap is much broader, even if you are cleverly mimicking the technique of using mock objects with something like cdk8s.
Closing that gap in the earliest pipeline stage is a challenging balancing act between speed and coverage. Leave too much testing out of the pull request validation, and problems escape long enough to require rework to most issues; try and validate all scenarios, and the pull requests start to pile up and collide.
Reconciliation is for speed. The effects of merging a pull request on the infrastructure tend to fall along the CRUD axis of creation, updates, and deletions.
In order of speed, from faster to slower, we typically see deletion, update, and creation.
When creating new resources starts to take disproportionally longer than the other operations, it makes sense to favor reconciliation tests against a standing environment, at least in the earlier stages of the pipeline.
Standing environments may also bring a financial cost to the balance between speed and coverage, a common gold-plated blindspot for many technical people. I am not saying you should be counting the dollars before issuing a pull request, but whoever designs the CI/CD pipeline for a GitOps practice absolutely should be making those calculations side-by-side with the technical decisions.
The safety of no reconciliation. In this approach, any change to the repository means deleting and recreating all resources corresponding to that change.
The approach has some merit, as it cuts down test scenarios considerably and ensures a more consistent overall system state due to components effectively resetting their internal state every time their configuration changes.
The drawback is that as the system scale increases to thousands of resources, one must be willing to become familiar with the underbelly of large-scale provisioning in most IaaS providers: API rate limits, network traffic jams due to terabytes of images floating around the private network, bootstrap bugs that only rear their heads when the network is congested, and many others.
Lesson #5: There is no rolling back
This lesson assumes the GitOps repository is versioned with branches or tags, as outlined in the previous article in this series.
In a multi-staged CI/CD pipeline, you have parallel environments such as “development,” “test,” and “production.” A pull request may pass validation in the early stages of the pipeline and still fail in production.
When a production environment has a problem, it is natural to expect people to call for a rollback, especially from outside the DevOps team. Rolling back makes a lot of sense in the abstract: if the changes broke things, then bring back the old configuration.
In the real world, however, a GitOps repository contains a mixture of imperative (scripted) and declarative statements, which means a previous commit does not ensure the system state will revert to what it was before the attempt to move forward. Still, even if referencing that old commit somehow brings the system back to life, you now have a disrupted pipeline flow and two unsavory choices:
- Create a new patch. Submit a change request for whatever afflicted the production system and roll it through the pipeline. All earlier stages will go from commit “N” to commit “N+1,” and the production stage will eventually make a long jump from commit “N-1” to commit “N+1”. That means the production system will have gone through different internal states than other stages, potentially increasing the differences between the stages.
- Reverse the pipeline. Roll back the changes on all other pipeline stages to match the rolled back state of the production environment. I consider this approach unthinkable as a standing practice because it is always possible that a completely unrelated pull request has already started its way through the pipeline. Even if one successfully reverses the entire pipeline back to a known baseline, there is still the unresolved matter of dealing with previously merged requests after ejecting them from the pipeline.
Given enough brainpower in studying the failing pull request and wizard-like ability with the Git command line, a rollback is often possible. Still, it is not scalable and rarely cost-effective, so it is crucial to keep a couple of design principles in mind to minimize those occurrences:
- Small and independent. I mentioned it before, but worth repeating it. Keeping pull requests small in size and impact on the infrastructure means it is easier to entertain a new pull request to undo the breaking change than considering a rollback across pipeline stages.
- One pull request at a time. Assuming the volume of changes allows for it, implement a policy of allowing a single pull request to roll across the pipeline. The pipeline can queue the processing of another pull request until it promotes the current pull request to production.
- Treat rollbacks as exceptions. Unless the infrastructure supports reliable downgrades to the previous version, do not entertain rollbacks as a regular, automated process. If the infrastructure has complex internal states or, worse, databases, rollbacks rarely bring the system back to health.
Ultimately, it may be impossible to completely rule out the chance of a rollback in a production environment, but following these tips can minimize their occurrence and impact.
CI/CD practices for application development are well-suited for GitOps repositories, but the pipelines produce very different artifacts from applications.
GitOps repositories contain blueprints that remote agents can follow while building or reconfiguring infrastructure. Changes have ramified and widespread implications even for small changes, so it is essential for pipeline designers to carefully consider the goals for developer productivity and costs to the organization.
For developers, GitOps requires renewed attention to the CI/CD principles of keeping changes small in size, realizing that “small” also extends to anticipating the impact of changes in the underlying infrastructure and adjusting pull requests to reduce that impact where possible.
These lessons helped me minimize the number of manual interventions to our GitOps pipeline and, more importantly, smooth out the flow of activities and the amount of rework around them. While some of the lessons may be more of a journey than concrete recipes, I think they are universal enough to help practitioners at various stages of adoption.
- Continuous integration vs. continuous delivery vs. continuous deployment
- Infrastructure as code: A non-boring guide for the clueless
- The GitOps Files — Repository design
- A practical guide to the continuous integration/continuous delivery (CI/CD) pipeline
- Argo Rollouts, the Kubernetes Progressive Delivery Controller, Reaches 1.0 Milestone
- GitHub Status API