This is a copy of the (editorialised) blog post I wrote for the Palantir Medium Blog. I was a founding member of the Incident Response team at Palantir and worked there between August 2021 and June 2022 at which point I changed roles to a Software engineer on the maintenance development team with a focus on fusion.
The Palantir Incident Response team addresses the highest-priority issues across our platforms — Foundry, Gotham, and Apollo — ensuring they continue to support mission-critical work around the world. Essentially, the team’s core mandate is to respond when things go wrong. More broadly, Incident Response focuses on business continuity while adapting to an ever-expanding feature set as development teams across Palantir continuously add new capabilities and enhancements.
The Incident Response team is both proactive and reactive, with responsibilities that span project work (around 60%) and incident management (around 40%). Given the breadth of its mandate and the wide variety of issues they’re tasked with, no day is exactly the same.
From engineers to analysts, all members of the Incident Response team are focused on optimising Palantir’s incident response capabilities and strategic planning. Team projects often involve developing additional internal tooling to manage issues; reviewing analytics to increase focus and decrease response time; and facilitating issue triage by writing plugins and implementing changes to ticket flows.
On the incident management side, each day can be action-packed as the team responds to the most pressing customer issues. The Incident Response team must be ready to handle any problem that arises, and to work efficiently and effectively across the business, coordinating with stakeholders while navigating documentation, metrics, and software operations interfaces.
In this blog post, Blake, a Palantir Incident Management Engineer based in London, shares a typical day on the Incident Response team.
When I start work in the morning varies a bit depending on my schedule that day. Today I’m on call, so I’m at my desk and ready to go by 9 a.m. The team takes turns being on call: once a week I’m on primary, meaning I’m the first one paged as issues arise, and one other day per week I’m on secondary (i.e., I serve as backup if the primary is at capacity).
Being on call means that at any moment from this point until my shift ends at 3:30 pm I could receive a page telling me there’s a time-sensitive, high-impact issue with one of our products. It could be anything from a monitor that was firing but has now recovered to a full outage of a whole cloud provider’s region — every day on call is a bit different.
I may have slightly more time to work on my projects today — but we’ll see!
Our team is close-knit despite the geographical distance between us, and I like to start my day by catching up on messages from my US- and Australia-based teammates. After I’m caught up, I turn to my to-do list. I usually don’t get paged until the afternoon when I’m on secondary, so I’m keen to take advantage of this relatively quiet time to prioritise my tasks and progress my project work. I decide to tackle a code review request from my teammate and a data analytics question from my team lead first.
Code reviews are a priority for me given my skillset — one of my responsibilities on the team is promoting code quality, so I ensure that we complete code changes and prioritise delivery of the most important updates. Not everyone on the team codes, but we all find ways to contribute to the shared goal of being more efficient in the face of a workload that grows along with Palantir’s products and user base. We also like to capitalise on the team’s unique vantage point to drive improvements across the business; our involvement in every high-priority incident means we’re often instrumental in eliminating entire classes of problems by identifying patterns and root causes.
The pull request I’m reviewing is for a Slackbot (a chatbot automation in Slack) written in golang. In this case, we’re looking to expand the scope of issues that this automation covers as well as to enhance our ability to distinguish active incidents from imminent ones. The team has already agreed on the broad direction that this pull request will take via the request for comments (RFC) process, so my feedback is mostly focused on technical details.
Once I’ve wrapped up the code review and sent feedback over to my teammate, I shift my attention over to the data analytics question from my team lead. She’s curious about whether our responsiveness to pages has improved in light of an automation we recently implemented that helps accelerate our initial triage of incidents. I turn to Foundry to answer her question. Foundry is a powerful data engineering tool (among other things), and the Incident Response team uses it extensively to make engineering decisions. For example, we ingest data on our use of Slack and Jira during incident response into Foundry, which allows us to perform analyses and better understand how to optimise our processes.
In this case, I’m able to reuse a Foundry dataset I recently put together. I reply to my team lead quickly and let her know that our responsiveness has improved; we agree that the data shows this improvement is attributable to our new automation.
I get a page for a low-severity incident (my teammate on primary is focusing on a higher-priority issue). An internal monitor is firing because a queue for asynchronous compute batch processing has grown unexpectedly.
By digging into an automatically-linked Datadog dashboard, I determine there is a host health issue at the cloud infrastructure layer which Rubix — our infrastructure layer just on top of the cloud provider — has automatically actioned by killing the hosts. This led to a few minutes of increased queuing for the customer’s jobs while our automation cleared the capacity issue.
To wrap this up, I use Datadog to provide a high-signal reference to what happened and to demonstrate that the issue has already resolved itself. I then deescalate the incident using our internally developed Slackbot that integrates with Jira, attaching this context to the ticket. As a final step, I send the ticket over to the Rubix team so they can improve this monitor and avoid this type of unnecessary escalation in the future.
Perfect timing for lunch! My teammates and I head downstairs to the canteen. This is a great time to catch up with colleagues from other teams across Palantir; it’s always interesting to hear about the problems they’re working on and how their approaches are evolving. There’s a buzz in the office, as we’re all excited to be back after a long hiatus during the pandemic.
My phone rings — it’s our paging service, and I know that this is most likely for an incident. I quickly head back upstairs to my desk and confirm that our automation created a Slack channel when this issue was raised as a high-priority incident. One of our bots has already asked the incident reporter — a colleague from a business development team — to add some information that’s missing from the companion Jira ticket. The reporter has answered the prompts in Slack and provided helpful context, so I’m quickly able to start digging into the issue.
It turns out that this Foundry instance is a canary environment, and a Palantir product team has rolled out a release that caused unexpected high-cost Elasticsearch (an open-source data store) queries for a particular integration set up by the customer. I recall the release and wait for Apollo to safely roll back to the previous version.
The reporter and I can see the problematic queries going away as the rollback happens, so I deescalate the issue and send it over to the team that released the now-recalled version. I don’t need to page them because at Palantir, releasing and recalling product versions is a quick, low-cost workflow, and the team responsible for the version is trusted to check the Jira ticket for the incident, gather context, and take appropriate follow-up actions on their own. If the product team needed to un-recall it for some reason, they could do so in just a few clicks.
In the unlikely event the product team needed to signal that this upgrade shouldn’t be recalled, they would have used schema migration semantics to block unsafe downgrades, at which point I’d page them directly so they could be in control of a safe recovery. You can read more about how Apollo provides constraint-based continuous-deployment in this recent blog post!
After this incident is resolved, I see that the primary also has no incidents, which means there’s a good chance I’ll have some uninterrupted time now. I book a room so I have a quiet place to focus on some project work. Today I’m refactoring a shared golang module that four to five distinct Slackbots will be able to use, with the ultimate aim of making it easier to roll out future changes and maintain these bots in their steady state.
My quiet time is over — I get paged for two incidents (the primary is already working on two). While I’m triaging the two, a third arrives as well. This volume of incidents doesn’t happen every day, but it’s something that most of us have seen before: at Palantir, engineers often have to make tough decisions about prioritization.
I’m able to triage one incident into a “no incident responder” state because the issue was raised by a product team doing their own investigation, and they confirm that no help is immediately needed from me. The Incident Response team is paged for high-impact incidents to ensure that every incident reporter has a partner when resolving an issue.
I triage the second incident into a lower-severity state. To set expectations, I then let this incident reporter know that I’ll be working on two issues, and that the other one is higher severity. As I’m about to shift my focus, the reporter of the lower-severity incident tells me they think they know what’s wrong, and that they’ve already gotten permission to page the product team. That’s good news — I page the product team and set that issue aside for the time being. I can now turn my attention to the other incident.
To communicate more effectively with the incident reporter, I start a video call. It turns out the problem is with our pipeline infrastructure for Python builds, so I decide to page the team responsible for packaging and developing our Python tooling for Foundry (Foundry allows users to write data pipelines in a variety of languages, including Python).
Before I can resolve the incident, it’s the end of my on-call shift; I page the new primary to take over the tickets I’m tracking. Our coverage across time zones and rotation of on-call shifts means I generally feel comfortable with the status of my project work — I know I can reliably find time to make progress on it when I’m not on call.
I spend the next block of time doing a second iteration of code review on the same pull request from earlier in the day. Afterwards, I pair up with a more junior member of my team to help them write some golang unit tests — they had a few questions about interfaces and the difference between pointer and non-pointer method receivers.
This type of side-by-side knowledge sharing happens pretty frequently on the team, as we’re focused on ensuring that everyone is able to grow both their technical and non-technical skills. We also have formal and informal mentorship networks to encourage all our team members to get exposure to a variety of perspectives across Palantir.
It’s time to shift away from my project work and over to my meeting block. I usually have meetings during the afternoon, when my US teammates on the West Coast start their workday. Palantirians often designate particular parts of the day for meetings or focus time, which makes it fairly easy to find time to meet without interrupting the flow of a colleague’s day.
After a weekly planning meeting with my project team, I join a video call with stakeholders from one of Palantir’s business development teams. Today, we’re discussing the way we handle and measure lower-priority incidents.
More specifically, we’re working on a way to address some of the priority inflation we’ve observed. The goal here is to balance getting business stakeholders the help they need as quickly as possible with our team’s bandwidth constraints and the overarching need to ensure we don’t miss high-priority incidents. At Palantir, we often begin meetings like this by building consensus on what the problem is — this helps us empathize with each other and iterate on potential solutions.
I finish every day around 6 p.m. as a way to keep a balance. The constant influx of information as other time zones wake up can make it tempting to keep working, but I’m able to make time for other things knowing that my teammates will handle any incoming requests while I’m offline. This evening, I’m joining a few others from the London Incident Response team to go bouldering, which is somewhat popular at Palantir. I personally like that bouldering lets me combine socialising with a bit of exercise.
Palantir’s Incident Response team has a business-critical responsibility to help resolve some of the most impactful product issues while continually innovating as our business and offerings grow. We’re looking for new teammates who excel in software operations, SRE/DevOps-style tool development, and more. We value a growth mindset, as well as an eye towards problem solving and process improvement: we’re always focused on finding the root causes of issues and working to make our platforms better.