A few years ago at Arctic Wolf I put together a talk titled “How to be on-call”, in response to the rapid growth of the organization and increasing number of on-call schedules. The talk turned out to be very popular and the recording became part of the onboarding process for new employees. While some of the talk was company-specific, much of it is applicable to the broader software industry, and I’ll share some of the highlights here.
I have been on-call for most of my career and led teams with on-call rotations, and have a lot of experience with the negative impact of on-call to my personal life and the lives of my colleagues. I’ve missed Christmas dinner (years later my Mom still brings it up), worked through weekends and nights, missed many kids’ events, and once juggled a fussy baby and an incident call at the same time. My goal is to make being on-call as sane as possible, balancing what the business needs with our collective personal lives.
“#oncallsucks” was trending circa 2020, and in response Charity Majors summarized the on-call responsibilities in On Call Shouldn’t Suck: A Guide for Managers “It is engineering’s responsibility to be on call and own their code. It is management’s responsibility to make sure that on call does not suck. This is a handshake, it goes both ways, and if you do not hold up your end they should quit and leave you.”
A few ways that #oncallsucks:
- Alert Fatigue — there are still too many horror stories of alerts constantly going off that are not actionable. Dan Ravenstone had a great talk at Monitorama “Thinking Critically About Alerting” about how to deal with this.
- Onblame Oncall — quoted from a colleague — quite the opposite of blameless, where the on-call responder is blamed for whatever broke.
- Always On — this was the culture at a former company, and you were always on-call even when on vacation. This was horrible in so many ways and led to quick burnout and high turnover.
- There Can Only Be One: you are the only person on-call, there’s no escalation process to get help, and no team to back you up.
On-call does not have to suck, but it’s hard work for everyone to remove the suck and even harder to continually keep it from sucking.
My original talk was focussed heavily on the individual, but I’m going to focus here more on how a team can support each other, and the responsibilities of the leaders.
The first question to ask especially as a leader is: do we need an on-call rotation? If your service has an actual SLA (with penalties etc) or is a true global service, or is supporting say a 24x7 manufacturing process, the answer is likely yes. There are many cases where the easy answer is yes, but under scrutiny that may not hold up. I have two examples were a 24 hour schedule could be reduced:
- If the AWS Trust and Safety team contacts you about potential abuse on your account, you must respond within twenty four hours. If that happens on say early Saturday morning, a response cannot wait until Monday, but it also doesn’t mean you need to wake someone up. In that particular scenario we configured the paging software to only callout during daytime hours, seven days a week.
- In the case of office network infrastructure, the temptation is to setup a 24x7 alerting process. In many cases a blip in the night is acceptable as long as the network is functioning by the time the first individuals arrive on site. Some creativity with scheduling means the IT staff can sleep most of the night before getting called out.
Don’t take it as gospel truth that you actually need a 24x7 schedule, be creative to reduce the impact to the on-call staff if at all possible.
So you are on-call…
Before you start your on-call rotation, it should be obvious that you should know how you’re going to be called, what you might get called for, and expectations on how to respond and escalate. I say it should be obvious, but it’s shocking how frequently this is all taken for granted.
At a minimum you should know:
- How am I going to be called — PagerDuty or VictorOps or X? Any setup required?
- Who gets to wake me up?
- Am I on the escalation path for anything?
- What are the expectations on response time when I do get called?
- Where are the automated alerts defined, and where are the runbooks? (There are runbooks, right?)
- How do I escalate to get help?
When you do get called, Don’t Panic! In my first few calls in the late 90’s I fought panic, because I didn’t have answers to the questions above and had no real escalation path. If you are the first responder, your job is to triage the alert and determine next steps. Those steps are highly contextual to your situation, but should always include the question “was it worth waking up for this alert”, and if the answer is no, tune the alert the next day.
Take care of yourself. Waking up in the middle of the night for a call always wrecks me — the adrenaline boost of the call means I won’t go back to sleep — anecdotally others struggle with the same. Do not hesitate to ask the team to cover the next night so you can get a night’s sleep.
Your team is on-call…
The best way for an on-call shift to be less stressful for me is to know that my team has my back, that I’m not in it alone. When I’m not on-call, the responsibility of supporting my colleagues falls on me. What does that actually mean in practice? A few suggestions:
- Write good runbooks. At Arctic Wolf alerts were defined in a GitHub repository, and runbooks were attached and enforced at commit time — you could not commit an alert without a runbook. This brought some discipline to the process of creating alerts, encouraged careful thought on how to debug situations, and meant that when called in the middle of the night you had a really good starting point. Put a slightly different way — if you can’t write a runbook which contains actions on how to resolve the alert, the alert shouldn’t exist. There are a lot of great resources on how to write runbooks, this article is a great example.
- Keep an eye on the on-call team member’s work load. If they’ve been up all night, organize a swap so they can take the night off. Some individuals (like me) have a tough time asking for help, but recognizing that in others I’ve had to almost forcibly take someone off a schedule to give them a break. The team’s leader can do this but it’s best if the team organically takes care of each other.
- Fostering a blameless culture means creating a psychological safety net, which is incredibly important when responding to incidents, especially when there’s significant pressure and a legitimate fear of “what if I make it worse”. I wrote an entire article on that: A few words about blameless culture.
- Create a good on-call onboarding process that answers the questions above in the “So you are on-call…” section. The first few calls can be extremely stressful. I’ve seen successful onboarding including call shadowing, where a new team member was on the same rotation as a more senior member, and they answered calls together for a few weeks.
The short summary: take care of each other, treat your team members as you want to be treated.
You are a leader with an on-call team…
The best, albeit most controversial, way to support your on-call team is to pay them for the extra time. You are asking your team to work 168 hours per week, impacting their personal lives and their nights and weekends, and you are adding significant stress to their jobs. Paying an employee for that impact sends a strong clear message that you recognize and appreciate the extra hours. There are many ways to do this and the actual amount may be affected by the specific country’s legislation, but my favorite was eight hours of pay for a 24x7 shift and a minimum two hours of overtime for a callout.
The added benefit of paying for on-call is that you will bring visibility of the cost of running service to the business, with a budget line item containing the cost. The rest of the business understandably doesn’t know what it takes to be on-call but they certainly grok budgets, and having a common language to discuss the actual cost becomes very important when planning for hiring, discussing adding more services or products that will require 24x7 support, and setting SLA’s for products.
A leader must measure the number of callouts, and take immediate action if the number starts to increase. I have seen three common reasons why the number of callouts can increase:
- Scale — Both of my previous two organizations scaled more than 2x per year for multiple years in a row, and at some point a service would naturally start hitting scaling limits. While we tracked a myriad of metrics showing the four golden signals, an increased callout rate was definitive proof that the service was suffering, and new feature development needed to stop so the team could address scaling.
- Services change over time, and alerts need tuning. Alert fatigue, where an alert firing frequently is seen as “normal” and then ignored, is a real thing and should be squashed as fast as possible. Tune those alerts!
- Along with rapid development comes the possibility of a service accidentally becoming less resilient. This can easily happen even to a service that isn’t scaling and taking on extra load, and needs to be addressed.
How you measure and track this will depend on your org, but I highly recommend your team has visibility into your tracking and decision making process so they can see that you are looking out for them, and the actions you take to slow the callout rate will speak volumes.
A team needs to know that their leader has their backs and supports them. Again that’s pretty obvious, but a few ways you can do this:
- Run interference for them. In a previous job I had an awesome manager who would stand in the path leading to our cluster of cubicles, preventing irate users and other leaders from interrupting us while we were resolving an incident. Modern equivalents to someone rushing your desk are out-of-band questions in Slack/Teams/Zoom/etc directly to your team members, and your team needs to feel comfortable to escalate those directly to you instead of trying to answer them.
- Create a good escalation process. PagerDuty’s internal policy, documented here, is “Never Hesitate to Escalate” — which first means your team needs someone to escalate to, and then a Blameless way to do so without any fear of retribution regardless of the time of day.
- I strongly recommend not scheduling project work for the on-call team member for the duration of their on-call shift, and build that into your project planning and estimation process. The individual can spend their time on bug hunts or research or annoyance-driven-development, and be ready for interrupt-driven work. A few of my teams implemented very successfully, it takes a lot of pressure off the individual and removes timeline risk from any planned project work.
- If a response team is physically in the office, send in food for lunch or dinner, or if the team is remote, hand out coupons to food delivery services where possible. This seems like a small gesture, but I fondly remember running large incidents out of a conference room over the dinner hour, and a stack of pizzas appearing without warning, sending a strong signal that our leaders were watching out for us.
One of the harder things I’ve had to do with on-call teams is to formalize response times and expectations. With smaller teams it’s easy to build trust and have an informal “best effort” policy but as teams grow so does the expectation of a more formal policy. I know first hand the impact that on-call has on personal lives and the difficulty of adhering to any policy. That said these were the guidelines I’ve recommended in the past:
- Be able to acknowledge an alert within 15 minutes.
- Best effort to get online within 30 minutes. If, when you acknowledge the alert, you know you can’t hit the 30 minute response time, escalate. I had a long commute and I frequently would either find a local coffee shop or escalate and join the incident call when I arrived at home.
- This can be a thorny one fraught with legal issues — be sober enough to be able to answer the call and work the problem to resolution. My personal rule of thumb is when on-call I should be legally able to drive, but that of course will depend on your local laws.
- You cannot be on vacation (PTO) and on-call. I was surprised I had to even state this but ran into it a few times. Taking time off is incompatible with being on-call, that’s not negotiable.
The last way to make sure your on-call rotation is healthy is to have enough people on it. The absolute minimum for a 24x7 on-call rotation, assuming weekly shifts, is four individuals. Any less than that and you will burn out your team, and I’m hesitant to even give the number four because in most cases it’s too low. Being on-call one week out of four puts a significant burden on a team member and will cause them to leave. Having six people is better, and eight is good. If your team has less than four or even six, join forces with another team and share a rotation until you can grow the team. A few of my teams implemented that as they grew and split, and the positive side effect was the teams wrote really good runbooks to support each other.
I’m convinced that on-call doesn’t have to suck, and while getting called in the middle of the night will never be fun, a good on-call culture can make it bearable.
- Charity Major’s article “On Call Shouldn’t Suck: A Guide for Manager”
- Dan Ravenstone’s talk at Monitorama PDX 2023: Thinking Critically About Alerting
- Writing good runbooks: Runbooks: An On-Call person best friend
- A few words about Blameless culture
- The Four Golden Signals is a section in the Monitoring Distributed Systems chapter of Google’s SRE Book.
- PagerDuty’s docs “Best Practices for On Call Teams” are excellent.
- This is a great article on the mechanics of scheduling: How to build an Effective and Sustainable On-Call Schedule For Your Team