blakehawkins.com

Rethinking stability

Posted 2020-08-05

This is about my role at work, and how I spend my time. I find that I am eager to participate, and overly motivated by helping people rather than helping myself. I find myself tracking 4-5 large workstreams at any given time.

I have reason to work less, and get more done. I want to introspect, and also plan concretely what I should focus on for the remainder of 2020. In addition to my core responsibilities, I plan to limit myself to 1-2 areas of focus this quarter.

Projects

I'm presently a Product Reliability Engineer. This is something like a cross between a support engineer and an SRE, with a specific focus on the stability of a few interrelated microservices.

My primary goal is only indirectly related to my day-to-day work: I define my own success by how stable critical software under my scrutiny is - but my core work responsibility is in support queue ownership, and incident response. These two work items provide a decent pulse for monitoring product health - but to excel, we must think and act more broadly.

Discrete improvements in product stability require concerted effort - we call these projects. Defining and investing in a project requires the same scrutiny one would expect from any other engineering role - clear motivation, scope, sucess criteria, RFC, etc.

As I think about these projects, I invent an explanation for my approach in defining one: a project is spawned from the intersection of (Q, E, O, K) - that's Q quarterly focus, E evidence of a specific shortcoming, O observed (mis)use of operational systems, and K pre-existing knowledge/expertise.

So, in evaluating projects, I'll think about QEOK. I'll use this as a tool for prioritising project candidates, evaluate my own strengths and weeknesses, and rationalise how past projects resolved.

Naïveté

As I consider my historic approach to work balance and projects, I imagine myself on ice skates. I move fast - and agile. I have direction. Others are impressed by my work. What I am doing can be taught and shared.

I'm also bounded to the rink. I'm not uniquely talented at skating. When I slip, it hurts. And in spite of being decent at skating, in the back of my mind I know that I prefer summer sports.

To replace skates with stability: in a past life I was a product developer for actually stable software. There are many differences between that life and this, but the ones that stick in my mind are:

  • Actually testing software. Automated checks do not count.
  • A strong sense of responsibility for making development decisions based on safety.
  • Actually reviewing code changes.
  • Architecting systems to be low-risk by default.

In this past life, being agile was something won over bouts of fisticuffs. Each new and modern practice gained also required a conversation about trade-offs, acceptable risk, and otherwise scrutiny from people who cared about reliability: ops engineers, support engineers, product managers, and test engineers. In practice, this sometimes looked like: we can do x, but must continue also doing y.

I got used to doing y sometimes. I saw the value. I felt responsible. I leaned on it as a safety net.

Today my life is a stark contrast. My naïveté is now apparent. The summer sport I dreamed of will not come for free.

To make the analogy more clear: an organisation can care about - and invest in - stability, but without extraordinary impetus, no one is going to wake up and decide to hire a team of software testers. Especially not someone with powerful and successful Continuous Delivery systems.

The motivation to ask the stability person about improving stability is too weak. The motivation to satisfy the new-feature itch is too great.

I must recognise that I am in the minority for caring about writing bug-free software.

Posturing

I'm afraid that part of why I haven't enacted the grand change I dream of, nor been satisfied by the stability wins that I have made may be an issue of an Us Versus Them mentality. I have a team, I feel that I'm part of that team, I learn a lot, and I value my team's heterogeneous expertise - but nevertheless I am plagued by the feeling of I could - and would - have written the same software, just more stable.

One way to improve this is with a bit of posturing.

  • Instead of drawing conclusions from specific experiences writing stable software, think about how those practices are applicable to my current world.
  • Instead of treating Continuous Delivery as a safety net, think about how we can make better use of CD.
  • Instead of thinking about shortcomings in (language, framework, style), think about how to maximise stability gains with the smallest justifiable incremental improvement - drawing from a pool of konwledge, rather than a photo of perfection.

Another way to think about it is that I must flip my expectation on its side. Rather than two worlds where one world should be made to look more like the second, I have this world and some knowledge/experience that can help make decisions based on a view of another world through some lens.

Just a bit selfish, if I'm honest

Part of why I sometimes lean towards my old methodologies, or lean away from depending too heavily on some operational tooling, is perhaps some distrust towards that tooling's capability to provide categorical solutions to problems. For example, I have an unstable microservices whose problems can't be solved by health checks.

Another part of me, though, just doesn't want to develop operational expertise, because it's not my interest or career path. It's not hard to reach for reasonable business justification:

  • Product Reliability doesn't require operational expertise (I have none) or operational responsbility (I am not)
  • Using operational tooling can reduce pain/impact, but doesn't improve stability from one release to the next (bugs are still present)
  • Operational tooling is not homogenous across deployments, so improvements provide disproportionate value for mainline users

I should suppress the ego a bit, and spend more time using these systems. I should look for areas where our use of these systems can be improved.

Local optima

It's getting late, so I'll keep this section simple. If I recognise that an issue can be resolved by - say - testing software, using sane and efficient testing practices: that doesn't mean that the right way to improve stability is to simply start a testing regime.

I should instead evaluate specific stability issues, and consider the sharpest tool to improve each instability, rather than reaching for a broad solution by default.

Conclusions

All sections above discuss shortcomings of preexisting knowledge (K). Operational systems (O) comes up repeatedly as a valuable focus area. I would be wise to invest in this area more.

Appendix: past projects

In the past I focused on projects that improved or fixed specific issues (E). These changes are good for burning down support burden.

These changes are also more satisfying (actually working on the product), but in truth, they don't scratch the software development itch. There are also diminishing returns - as more work is invested in this area, eventually the remaining bugs are only minor ones.

My advice is to focus heavily on (E) for unstable software, and reevalate that prioritisation once successive improvements begin to feel obscure.