blakehawkins.com

Engineering sleep

Posted 2023-03-05

It has taken a long time for me to appreciate sleep, and it's still something I struggle with.

A software operations lesson

To help understand how and why I "engineered" my sleep, I'll first explain some software operations learnings.

Note: the lessons below apply to product-centric software development, as opposed to low-latency or heavily engineered software. One assumption/axiom I've taken for this conversation is that, for product-centric software, the P95 latency for user actions and endpoints actually should be much worse than the median (claim: if it's not, then the software could be automating more or stretching its product goal further).

I used to be a kind of software operator. In that role, I spent a lot of time thinking about how to make software services more reliable. One of the things I learned was how to think systematically at an organisation level -- how to influence people and teams to care or focus on specific things, to impact software reliability with as wide a footprint as possible. I would call this my first experience in high level leadership.

When I talk about this systematic engineering of people and teams, I'm not talking about hierarchical demands like documenting code quality standards, or enforcing unit-testing. Rather, I'm talking about how to be a voice that others want to listen to for their own benefit. I think of this as "unenforced leadership". It's about building trust, making your input high ROI, focusing on problems, and using higher order thinking to identify and suggest non-obvious solutions.

And I wasn't the only operator who made these kind of observations or had that kind of influence. And software operators weren't the only engineers who used this approach to having an impact. Product leadership, product architects, and respected developers all used this same framing to nudge the product towards "good".

The company more generally lived and breathed operational excellence. Where other organisations used maxims like "move fast and break things" to build software fast, we used operational excellence to fight back the deluge of challenges without losing the north star of high-reaching product goals.

I would argue that this was the secret that enabled the company to make high-risk investments and pour resources into going deep on design or product goals without falling apart at the seams.

In other words, this became a tool to unlock developers to focus on product rather than engineering, and also to better make decisions about when to invest in high-impact reliability improvements versus just building.

All of this is to add to the weight of my next statement -- the most valuable operational lesson I learned at an operationally excellent company was a specific and technical one:

Always engineer to improve a P95 or P99 (or in rare cases, P999).

The subtext is deeper and nuanced:

  • The maximum (Pmax) is frequently outlandish due to rare events like network faults or OS scheduling anomalies.
  • The median or mean feels right because those values usually represent a "typical" case, statistically speaking.
  • The P9x is right because it represents the worst case that a typical user will have experienced, and because in engineering problems, most of a user's lost time will have been spent in a bottleneck's P95.

To repeat: if you want to improve the average user's experience, you have to improve the P9x because each user will have themselves experienced the P9x.

In practice, you should apply this maxim by:

  1. Measuring. (If you're not measuring, you're reading the wrong blog post, and in need of remedial education.)
  2. Determining what you're optimising based on end-to-end measurement or user experience analysis.
  3. Optimising specific flows based on insights. Here I assume that you've already found that some particular computational workflow is the thing to optimise.
  4. Selecting P9x samples from within that flow and then improving them.

The last step might feel unnatural. Again:

  • You shouldn't pick the slowest sample because it frequently won't be a representative case for a typical user's experience.
  • You shouldn't pick an average sample because it will lead you to improving a happy path rather than real users real bottlenecks.

There's a bonus that is often overlooked. When you optimise the P9x, the median normally improves too.

Engineering myself to sleep more

I applied these principles to my own life. I wanted to ratchet up my P95 sleep duration. Instead of sleeping in longer, or setting a longer alarm, or going to bed earlier, I set a series of progressive rules, hard-limiting myself to "no later than" times to go to sleep.

I started with 0200 in 2021, then adjusted this to 0100 in 2022, then finally midnight in 2023.

I'm not trying to affect my absolute max latest bedtime, because I don't mind occasional exceptions when I'm jet-lagged, or on holiday.

But I am engineering my P99 by ~never going to bed later than the time specified in the rule. In 2023 I've never gone to sleep after midnight.

Has it affected my median sleep? Definitely! I never sleep less and I ~always sleep at least an amount that was more than it was in 2022.

Has it affected my max sleep? Yes too! Since I'm used to going to bed before midnight, I'm more frequently ready to sleep at 2345, 2330, or even 2300.

TL;DR

I set a rule for myself to ~never go to bed later than midnight. I ratcheted this down to midnight over the last few years. I believe this rule works especially well because it optimises for the P9x of sleep duration.

There's more to optimise (diet, exercise, entertainment, money, etc) and I plan to use this approach in more areas in future.