Design strategies for GitOps repositories
(This story is the foundation of my other story, designing CI/CD pipelines for GitOps.)
Spend a few hours in your newly established GitOps practice, and you soon realize you need to decide how to represent your infrastructure as folders.
On the one hand, infrastructure is rich in concepts and relationships, spread over physical and logical entities. On the other hand, Git repositories have only a few building blocks to represent that knowledge: branches, tags, a folder tree, and files inside the folders.
Now you have a classic mapping problem. Should folders be organized by device class? By pipeline stage? Maybe dedicate a folder to each specific device? How about relationships?
This article lists the core lessons I learned while establishing a GitOps practice for deploying and managing Kubernetes (OpenShift) clusters running software stacks with thousands of containers.
Lesson #1: Beware the topology
Do not spend too much time developing an elaborate folder structure resembling the structure of the deployment environment. Anything over a certain depth makes navigation cumbersome, documentation more complicated, and explanations longer.
My magic number is three levels. Your number may vary, but the challenge past that number is that folder trees are relatively narrow and rigid structures without the means to add attributes to folders or relationships. While one may assign semantics to levels in the tree, those semantics are not visible in a file explorer and need to be written down somewhere else.
As a result, “tall” trees not only result in dissonant changes of themes across folder levels, these external semantics do not hold up well as the system evolves. For instance, take the somewhat long navigation path below as a concrete representation of how folders are organized inside a Git repository:
Such implicit folder structure may feel like a sensible containment tree, but what if you need a new Kubernetes cluster deployed with redundant workers across multiple regions? Now you can no longer respect the semantics for the second-level folders because they represent a single region containing the cluster workers.
At that point, you have two options:
1. Adding a “multi-region” folder under “production.” The second level of folders in the tree would have a conceptual mixture of specific regions in the infrastructure provider and, disconcertingly, a logical representation that would not match a particular region.
2. You struggle with the cognitive dissonance in option “1” for a few minutes and consider removing the concept of regions from the folder structure, first ensuring that there are no region-specific settings for the clusters. Once you finally remove the concept of regions from the tree structure, you start wondering why you added it in the first place.
The insidious aspect of leaning towards topology mappings is that they initially feel right, even when not strictly necessary or valuable. Therefore, it is important to question every new extra layer proposed to the folder structure, asking yourself whether the new layer will inform considerations and decisions in the workflows around the repository.
Lesson #2: Versioning with branches and tags
All changes must eventually converge to the main branch, but iterations should be versioned using branches and tags in the repository.
Branches and tags are the only extra structural dimension for the tree-based organization of the folders in a Git repository. I use the word “structural” because one can create new dimensions through external conventions, which incurs many of the pitfalls listed in the previous section.
Git branches and tags are essential to support parallel deployments of the same repository across different deployment zones. That ability to support parallel deployments is needed when validating repository changes across different stages of a deployment pipeline.
As you work with branches and tags, it is often beneficial to adopt semantic versioning (https://semver.org/) as the naming strategy. Semantic versioning rules are sensible and well-documented, making it easier for users of a repository to identify more recent versions and quantify the differences between branches as major, minor, or simple patches.
I have seen teams implement simplified versions of semantic versioning without the “major” field, reasoning that “major” entails backward-incompatible changes and that there should never be such a thing in a continuous deployment pipeline.
Feature branches are ok. I am excepting the usage of short-lived “feature” branches from this lesson. The use of these branches is undisputed since these are the branches created temporarily to support reviews and validation of pull requests. Merged branches add up quickly and start to become a source of clutter and confusion, so I recommend turning on the respective Git provider setting to delete branches immediately after being merged to the target branch.
Lesson #3: Isolate environment-specific settings
Not all differences are created equal. A core design principle for a continuous deployment pipeline is minimizing differences between stages because fewer differences increase the chances that a change works in the next pipeline stage.
Before discussing solutions, let’s qualify those differences:
- Intrinsic differences. These are long-term structural differences between different environments, such as a VLAN identifier, different numbers of VMs, or different cloud regions.
- Versioning differences. These are the expected transient differences while changes are validated in the previous pipeline stage, such as image URLs for the containers used in the product or versions of micro-services comprising the whole product.
Intrinsic differences should reside in files named as specific to a pipeline stage, such as “values-production.yaml” or “values-us-south.yaml”.
Versioning differences should not go in files named or otherwise marked as specific to an environment. There may be rare exceptions to that rule, but as Ford Prefect once said: “Even if you prove it to me, I won’t believe it.”
With that combination of folder structure, file contents, and branching, let’s walk through the two typical workflows: one that should affect all deployments and one that should only affect a single environment:
Modifying all deployments
- The genesis. Developer clones the repository and checkout the “main” branch. Usually, this should be the default branch for the repository.
- The developer creates a new branch named “feature-1”, makes local changes to the cloned repository, commits the changes, and pushes the new “feature-1” branch to the remote Git repository at the Git Provider.
- The developer submits a pull request for “feature-1” into the “main” branch (For GitLab users, that would be a merge request.)
- Trying times ahead. The Git provider issues an event about the pull request to an automated process — owned by the repository owners — responsible for applying the change to the “dev” environment.
- The automated process applies the changes in “feature-1” to the development environment and starts the validation tests.
- A new branch is born. Assuming the validation tests pass, the new “feature-1” is considered good enough to be merged back into “main,” and that new version of “main” with the changes becomes the basis of a new versioned branched. Note: An automated process cannot guess the correct name for the new branch in a semantic versioning scheme, so it is helpful to allow the author to add that type of clue in the text of the pull request.
- Failure is an option. It is still possible that validation in the next pipeline stages uncovers a problem with the new branch resulting in it being sidelined as failed, stopping its progression across the pipeline.
- Retreating is not an option. While there are proponents of a rollback strategy for the stages containing a failed branch, accept that rolling forward is the better approach (A topic I covered in a separate article about CI/CD pipelines for GitOps.) In the “roll forward” strategy, a new issue is automatically created, and the team works on a new patch that addresses the issue. Fixing this situation takes precedence over any other modification, and the pipeline is closed for further changes.
- If it doesn’t fail, keep it going. With the pull request merged into “main,” that request is closed automatically, and an automated process starts to deploy the changes to the next stage of the pipeline.
Modifying a single environment
If environment-specific folders and files are isolated to small corners of the repository, this scenario is virtually identical to the modification of all environments.
There is room for some optimizations, like skipping slower or expensive tests in environments other than the one being modified. However, one must carefully manage the costs of developing these unique paths and weigh them against the volume of changes to single environments.
No branch per environment. I must put a strong emphasis on the word “versioning” in the title of this lesson. “Beware the topology” still applies, and trying to create a branch per deployment environment is riddled with pitfalls. This article covers the main problems in great detail, so I will not revisit them here.
Lesson #4: Beware the Monorepo
“Monorepo” is a strategy where the entire infrastructure is represented in a single Git repository, such as applications, services, and infrastructure.
I consider this approach useful for demonstrations and class material, where the content is managed by a few people and applied to non-production, short-lived environments.
For production environments, a GitOps practice is likely using some form of managed or self-hosted Git provider, such as GitHub, GitLab, or BitBucket (Note that GitLab is the only provider to open-source its code and allow self-hosting.) These Git providers have a few architectural constraints limiting the usage of monorepos.
Managed Git providers attach extra configuration, eventing, workflows, and access control to Git repositories. As your organization grows, disciplines like application development and infrastructure tend to become more specialized, with that specialization manifesting itself in at least two relevant dimensions:
Workflows. Git providers support automated handlers for events in the repository, such as creating new branches, creating new commits, or requesting a code review.
Handling the creation of a new commit on an “infrastructure” folder is likely to require significantly different validation strategies when compared to handling a change to the “apps” folder. When all folders are in the same repository, every change to any of the folders generates a new event, leaving it up to the automated handler to reroute the event.
Writing and maintaining routing event mechanisms can be fun in their own right but can also become avoidable distractions in your GitOps practice.
Work management and access control. With different workflows come specialized sets of people. Someone who understands the releases of the shopping cart application may not necessarily be the same person who can vet a merge request to assign virtual machines to a load balancer.
Git providers do not offer access control per folder, file, label, or branch, so organization members end up having access to everything in that repository. Update on 4/16: Joe Bowbeer noted in the comments section that GitHub’s “code owners” feature could ensure a minimum set of reviewers based on file names in the pull request.
Skipping past eventual security concerns around Sealed Secrets, repository members may still be distracted by extraneous churn and notifications. Over time, as the volume of uninteresting notifications increases, you start to notice some undesirable outcomes: Either people muscle through the noise or people turn off the notifications. Muscling through the noise incurs individual productivity penalties while turning off notifications slows down the flow of activities.
Lesson #5: The workflow rule. Workflows rule.
This lesson extends the previous one: if monorepos have limited application, it follows that one should always have multiple repos, but which ones? While we examine this lesson, it is also important to note that an excessive proliferation of repositories also introduces new costs in managing all these repositories, keeping tabs on access control lists, webhooks, access tokens, and many others.
Even as Git providers support some consolidation of those settings in a parent “organization” containing multiple projects, that means sidelining people and resources to own the shared responsibility for those settings.
Lesson #1 shows how topology or containment mappings may not be the best organizing principles, so it is time to let folder structures take a back seat and remember that Git providers organize workflows around repositories.
The rule of thumb to define boundaries between repositories is to look at the Git provider settings and find the overlaps between people and automated processes. If people and processes overlap entirely, then a single repository is likely sufficient for that team.
If people or processes differ, it is time to consider another repository, possibly under the same Git provider “organization.” An “organization” works as a shared container for multiple repositories. It is a good place to share settings across repositories, such as a well-designed continuous integration pipeline for building containers, configuration linters, etc.
Working with deployment environments and Git repositories is a constant exercise in mapping complex topologies to simpler structures. While a deployment structure may seem like a tempting starting point due to the team’s familiarity, topologies tend to be a misleading organizing principle.
Many mapping rules may be costly to be established, only to fail a moment later when you try to represent new topology nuances. Organizing repositories around workflows and people keeps the mapping exercises focused and reduces friction between day-to-day activities and folders.
As the practice grows, use branches and tags to keep your deployment pipelines moving forward. As processes mature, promote common automated workflows to the organization level while keeping repositories isolated to minimize distractions in manual flows (e.g., reviews and approvals.)
I hope these lessons are helpful in new and established deployments, and I would like to read your comments (and links) on the topic. Now, let’s deploy some systems.