Unexpected Continuous Deployment Characteristics

Posted 2020-11-25

Years ago, continuous deployment was a contractor buzzword. The value proposition for this hype was about optimising the feedback loop as product changes arrive to users. Devops engineer became a job title, and even the most apprehensive legacy systems owners drew success from more frequent product releases.

Overall, though, continuous deployment is a pretty simple concept. At least for modern software, getting the basics working is straightforward.

In 2020, it comes as no surprise if I tell you that my employer's deployment model utilises blue/green deployment, with software-defined canaries based on risk tolerance, and automatic rollback using health checks and monitoring.

This post is about the more unusual deployment characteristics: things we have that might surprise you, and also ideas that don't yet exist at all.

Galaxies

A very basic CD infrastructure subscribes to product releases, and deploys them to the production environment. Surprisingly, I know of at least one huge tech company that doesn't do more than just this for its CD.

Thinking about what production environment means, it prompts us to consider an alternative environment: typically, a staging environment, where developers deploy - for example - test realeases for validation.

If we have continuous deployment into both a production and staging environment, we've made a shallow first step into a concept I've seen called galaxies. For the second step, I recommend reading A sky full of clouds, which describes how continuous deployment can work when there are 100+ different production environments (say, one for each software customer).

In this second step, the services that comprise the deployment infrastructure can themselves be deployed anywhere. The only thing that's needed is for some infrastructure controller to be routable from each production environment.

So the third step, multiple galaxies, is having multiple infrastructure controllers. The most simple example would be one global controller running in AWS, and one private one that's only used within BigCompany on an air-gapped local network.

A sidebar on air-gapped networks

Is this a blog post within a blog post?

The intent here is to help justify why multiple galaxies are a real thing that we must all consider for our deployment infrastructure, but I also just have some cool stories. For context, supporting air-gapped networks has been a critical, non-negotiable feature multiple times in my life.

The first time I experienced air-gapped networks was in 2014. The specific use case was like something out of a movie. The network consisted of multiple, simple clusters of a load balancer in front of two services. Normally, these load balancers are configured staticly, peering to foreign networks. In this particular instance, though, we observed network behavior where the loadbalancers would drop visibility with one another, temporarily failing health checks, but often reconnecting. When we inquired about network conditions, the customer responded that the loadbalancers were deployed on tanks.

If you can recall as far back as 2014, something else was released: Kubernetes. Along with kubernetes' popularity came all of the other shitstorm of "critical" services that everyone has to understand - things like the "service mesh".

Anyway, in a hackathon a year or two later, I hacked on a project dreamed up by the brilliant Richard Lupton. At the time, the two of us were working on a loadbalancer ourselves as our day job. And the hack idea was to deploy our loadbalancer as a side-car in a kubernetes cluster. Yes, it worked, and the astute reader may recognise that this isn't so dissimilar to the tanks, and also not so dissimilar to what a service mesh does.

To close down the digression: loadbalancer, kubernetes, service mesh, or not: air-gapped networks are real (and cool), so don't ignore them. Your deployment infrastructure should probably take into account that disparate deployment controllers might eventually be a requirement.

FIFO workloads

I work on a product that schedules compute workloads (spark jobs) into a deployment environment. I also happen to not work on the team that owns either the deployment environment, or the continuous deployment infrastructure. You may infer the lesson already: temporary workloads, at comparatively huge volume, are otherwise not different from normal deployed services. There is still a desire to upgrade them. There is still a desire to blue/green them. There is still a desire to get feedback on health checks.

That's not to say that such workloads don't have unique challenges. Spark wants to scale resources horizontally on a job-by-job basis. So, minimally, you need a way to statically allocate FIFO jobs as groups of services. You also need to consider that the code running in each compute job is potentially owned by someone different: so treating the jobs as all one kind of service is helpful in some contexts, but damning in others.

As a specific example: consider what issues present themselves if one workload successfully upgrades spark, and another workload fails to. How do you decide when to globally roll back? How many job failures are enough to consider the new version broken? If a version is declared broken, should that automatically apply across the fleet? Is it appropriate to quietly downgrade the jobs that may by this point have succeeded hundreds of times?

HA state/IO duplication

If you own an application that successfully and reliably uses a highly available distributed data store: great job.

There are some workloads, however, that just cannot do that. Suppose you own a telescope that measures cosmic background radiation. Your device is sampling megabytes of data at high frequency - read: very high bandwidth. It's impossible store the whole stream of data, but it's possible to do lots of other things, like:

Store a buffer
Drop data (sampling)
Control downstream read rates (in a processing graph)
Further buffer downstream
Re-read from upstream buffers
Duplicate processing I/O (multi-producer, multi-consumer queues)
High frequency processing on a local buffer

TLDR: stream processing.

To simplify, supporting stream processing in continuous deployment requires another kind of workload scheduling, not dissimilar to the FIFO workloads: it must be possible to define pairs of 1+1 HA services that duplicate I/O in a processing graph for streaming computation. This is needed to enable blue/green upgrades without losing in-memory data.

It's not fully clear to me what exactly is needed from continuous deployment to reliably support this. If the streaming process graphs are statically defined, then so can its CD deploy statically - where each vertex in the processing graph is its own service. If stakeholders are dynamically defining, manipulating, and modifying the streams, graphs, and computation definitions, then you have a deployment problem which is rather more difficult than the FIFO scheduling.

Canary categories

Configuration is evil.

When deploying software, you really want all your users to use the software with the same settings in the same way. Clearly that's not how the real world works, so for the few cases where there are critical differences from some users versus other users, you want canaries that do a good job of capturing the differences in configuration.

As an example, suppose your customers either use AWS or Google Cloud for your service's backend. For the front-end, they either use Chrome or Firefox. You should select a canary in each combination of backend and frontend.

If your fleet is big enough, you probably also need to consider multiple "layers" of canaries (i.e. a third release track). Your highest risk-tolerence stacks are on the YOLO release-track. At least one of your the backend is running on tanks deployments might be on a lower risk-tolerence release track - but potentially still a canary. You want to find out if your releases work for tanks long before your release is stable for the entire AWS and Google Cloud users, yet clearly rollbacks on this environment are higher cost than the high-tolerence YOLO canaries. You need multiple layers.

Network segmentation and ingress/egress policies

If you want services within a logical deployment to have strong network isolation - say at layer 3 - you'll need to think about managing temporary virtual networks as part of your deployments. In the extreme case, you may even need a separate network for each service, with explicit peering policies with respect to each other service. This, too, is all a deployability problem.

Service lifetime

In the cattle rather than pets service infrastructure model, dealing with kernel upgrades and service auto-scaling is "easy": just kill instances and create new ones.

There are cases, however, where you want something more than just scaling down or rolling restart a service. Suppose you're AWS, and you offer private instances of a secretbook service. A user provisions a secretbook instance, which is read one page at a time, but can only be read once. Logically, this book is tied to the provisioning user's identity, and the book's lifetime is tied to the last page.

This was a contrived example. A more practical example works for the previously discussed compute job workflow. Thousands of jobs are being launched every day, and your infrastructure needs to track the lifetime of each job - as well as potentially make deployability decisions based on those lifetimes.

External resources (disks, GPUs, etc)

From time to time, reusable/detachable media is critical. This one is not very surprising, but nonetheless: your infrastructure handle use cases where users need to share access to a resource that is fundamentally not highly available.

Dependency constraints

Wow, your deployments are big and complicated now. You have multiple layers of canaries, clusters of services running temporary compute workloads, tanks.

What happens when you need to make dependency changes that are not backwards compatible? What if they're backwards compatible, but only on a limited subset of versions where a migration can be run?

This is another continous deployment challenge! Your deployment infrastructure needs to ensure that service upgrades are compatible with other services on a deployment-by-deployment basis. Unless your backend infrastructure is incredibly prescriptive and standardised, there are myriad ways that services can become incompatible with one another.

I recommend two features: dependency constraints (service A version 3.1.0 depends on service B at least 2.14.0), and a generic form of migration identifier (service A 3.1.0 is capable of reading/migrating version-compatibility schemas B and C). This addresses intra-service dependencies and also upgrade range dependencies. Clearly, easier said than done.

The end

The motivation for this post wasn't actually to enumerate all of the ways to make continous deployment more complex.

I'd like the reader to instead take-away that - at least in terms of deployment infrastructure - compute workloads are not much different from regular services. So, if your deployment model plans to support something like compute tasks, I recommend baking those requirements into your CD infrastrucutre from the onset.