Incident Management: A Proactive Approach to Minimize Disruption
Experiencing an incident can be chaotic, with alarms sounding off and urgent communications flooding in, often at the least opportune moments. The key to effective incident management is not just to respond with urgency but to anticipate and mitigate issues before they escalate.
The TL;DR
To effectively manage incidents, you must first establish visibility across and baselines of your network. Know what’s normal and what’s not. Ensure robust logs are being collected in a centralized manner, and contain all relevant details to facilitate troubleshooting. Develop your response team. Know their strengths and weaknesses, and ensure everyone on the team has skin in the game. publish playbooks, runbooks, and other procedural documents so everyone knows what is expected of them during an incident. Foster relationships with your peers and other teams long before an incident ever begins. And always strive to improve your incident management process with lessons learned after every incident.
What seems to be the problem?
Most of us probably aren’t incident managers, but I’m sure we can all agree that the best incident is one that never happened. How do we stay proactive in our approach to preventing and mitigating outages before they are waking us up? You must first establish comprehensive visibility across your network. Understand the baseline of normal operations so deviations can be promptly detected. Implement a monitoring solution that provides real-time insights into your system's health, enabling early detection of potential issues.
There’s no investigation without evidence
Logging serves a valuable and necessary purpose in troubleshooting, but it provides limited assistance in our quest of being proactive against outages. However, when an outage does occur, logs are crucial. Ensure all critical systems have logging enabled with an appropriate level of verbosity. Logs should be sent to a secure, remote collector to prevent tampering and loss. This practice is vital for both proactive monitoring and efficient incident response.
History doesn’t write about one-man-armys (except John Rambo)
Many network engineers (myself included) tend to have a hard time handing things off to others when they know they can do it themselves. When we do this, we create a bottleneck where we are the only person able to work on or resolve certain issues. here are some suggestions to get yourself out of the way and solve problems faster:
- Take a step back - Take a look at your team and assess the strengths. These might be technical skills, they might be soft-skills, their super-power might be their timezone.
- Develop your peers - For a long time, we defined our value by the institutional knowledge we hold. That only leads to more late nights and burn-out. Bring your peers up to speed so you can work together more effectively.
- Play like it's a team sport - We're all in this together and it is much less stressful knowing you can count on your teammates to get the job done.
This doesn’t reduce job security, it strengthens the team, which in turn improves work/life balance for everyone.
So how can we enable those around us to be successful in our absence? One way is the dreaded “playbooks”. Sure, there are always hundreds of variables in any incident that could affect the troubleshooting route you take, and it is completely impractical to think you’ll ever document every possible scenario, but you can identify the issues that occur in your environment on a regular basis (you’re probably already thinking of one or two right now) and can easily document the redundant steps it takes to resolve those common issues. Even the less common, critical outages often require the same initial triage steps. Documenting this initial data gathering in a runbook helps to better prepare the on-call engineer once that page does fly, and in turn, hopefully reduces the time to resolution.
Know thy resource availability
So now you find yourself in the middle of an outage. The level 1 team has performed all the steps outlined in our playbook and gathered all the details they can. They’ve done a good job at determining the scope and impact of the outage, and believe they have the issue narrowed down to a core router. Now the on-call engineer has been engaged. He reviews all information gathered from the level 1 team, and trusts in their skills and abilities. He starts investigating at the core router with what appears to be a routing issue, but quickly identifies that the traffic in question does not seem to be passing through the firewall. And now he must engage the firewall team. There’s only one problem. The firewall admin is on vacation and no one else can log into that firewall. Now what?
Have we met?
Cross-functional teams are crucial in the midst of an incident. Even the best playbooks, the most verbose logging, and the strongest team can mean nothing when you’re stopped dead in your tracks needing assistance from another team. What’s the best way to address this? Do you perform cross-functional training across teams? Cross-functional training and access can be beneficial, but should be balanced with security and operational integrity. And when writing the RCA, don’t blame another team for the extended outage. Even if it may be true, that doesn't help during the next outage, nor instill customer confidence in your organization. Building relationships with those you may need assistance from is crucial. This must be done long before the incident ever occurs. Having a level of trust and camaraderie with other teams can make all the difference. Afterall, you’ll net more bears with honey than with homework.
Post-Incident Analysis
There is no such thing as a smooth incident. After resolving the incident, it's crucial to conduct a thorough review to identify what worked, what didn’t and what you wish would have happened differently. This includes a detailed post-mortem analysis to identify root causes, evaluate the response efficacy, and pinpoint areas for improvement. Document lessons learned and update your playbooks accordingly to strengthen your incident management process.
What do you think? What are your best practices for incident management? What doesn’t work? Add your comments below!
About the author:
Joe Tyler is nobody special that has been responding to support incidents both as a technical resource and as a manager for over 12 years. He has managed incidents in everything from small local government environments to global vendor TAC centers.