How to Build an Effective Incident Response Playbook
Build and test incident response playbooks for your team. Covers severity classification, communication templates, war room procedures, and post-mortem frameworks.
Organizations with tested incident response plans save an average of $2.66M per breach compared to those without. The plan isn’t the document — it’s the rehearsal.
Step 1: Severity Classification
| Severity | Definition | Response Time | Example |
|---|---|---|---|
| SEV-1 (Critical) | Revenue-impacting, data breach, full outage | 15 min | Payment system down, data exfiltrated |
| SEV-2 (High) | Partial outage, degraded performance | 30 min | API error rate > 10%, one region down |
| SEV-3 (Medium) | Minor impact, workaround exists | 2 hours | Slow queries, non-critical service down |
| SEV-4 (Low) | No user impact, internal only | Next business day | Monitoring alert, log anomaly |
Step 2: On-Call Structure
# PagerDuty schedule structure
on_call_rotation:
primary:
schedule: weekly_rotation
escalation:
- wait: 5_minutes
target: secondary_on_call
- wait: 10_minutes
target: engineering_manager
- wait: 15_minutes
target: vp_engineering
incident_commander_pool:
- senior_engineer_1
- senior_engineer_2
- engineering_manager_1
- engineering_manager_2
roles:
incident_commander: "Coordinates response, makes decisions"
communications_lead: "Updates stakeholders, customers"
technical_lead: "Directs debugging and remediation"
scribe: "Documents timeline and decisions"
Step 3: Response Playbook Template
## Incident Playbook: [Incident Type]
### Detection
- Alert source: [PagerDuty, Datadog, customer report]
- Initial indicators: [what triggers this playbook]
### Triage (First 15 Minutes)
1. Acknowledge alert in PagerDuty
2. Join war room: [Slack channel / Zoom link]
3. Assign roles: IC, Comms Lead, Tech Lead, Scribe
4. Assess severity using classification matrix
5. Start incident document from template
### Diagnosis
1. Check dashboards: [links to relevant dashboards]
2. Check recent deployments: [CI/CD link]
3. Check infrastructure: [CloudWatch / Datadog link]
4. Run diagnostic commands:
```bash
kubectl get pods -n production
kubectl logs -n production -l app=api --tail=100
curl -s https://api.company.com/health | jq .
Remediation
-
Option A: Rollback last deployment
kubectl rollout undo deployment/api -n production -
Option B: Scale up to handle load
kubectl scale deployment/api --replicas=10 -n production -
Option C: Failover to DR
./scripts/failover-to-dr.sh
Communication Templates
Internal (Slack #incidents):
🔴 SEV-[X] Incident: [Title] Impact: [description] IC: @[name] Status: Investigating / Identified / Monitoring / Resolved Next update: [time]
External (Status Page):
We are currently experiencing [issue description]. This impacts [affected services]. Our team is actively working on resolution. We will provide an update by [time].
Resolution
- Verify service is healthy (dashboard green)
- Monitor for 30 minutes post-fix
- Update status page → Resolved
- Schedule post-mortem within 48 hours
---
## Step 4: Post-Mortem Framework
```markdown
## Post-Mortem: [Incident Title]
**Date:** [date]
**Duration:** [start – end]
**Severity:** SEV-[X]
**Incident Commander:** [name]
### Summary
[2-3 sentence description of what happened and impact]
### Timeline
| Time (UTC) | Event |
|---|---|
| 14:02 | Alert triggered: API error rate > 5% |
| 14:05 | On-call acknowledged, joined war room |
| 14:12 | Root cause identified: database connection pool exhausted |
| 14:18 | Fix applied: increased pool size from 20 to 100 |
| 14:25 | Service recovered, monitoring |
| 14:55 | Incident resolved |
### Root Cause
[Detailed technical explanation]
### Impact
- [X] customers affected
- [Y] minutes of downtime
- $[Z] estimated revenue impact
### What Went Well
- Alert fired within 2 minutes
- War room assembled in 5 minutes
- Fix applied in 15 minutes
### What Could Be Improved
- Connection pool limits weren't monitored
- Runbook didn't cover this specific scenario
- Customer communication was delayed
### Action Items
| Action | Owner | Due Date |
|---|---|---|
| Add connection pool monitoring | @engineer | [date] |
| Update runbook with DB pool playbook | @sre | [date] |
| Automate customer notification | @platform | [date] |
| Load test with 2x normal traffic | @qa | [date] |
Incident Response Checklist
- Severity classification defined and documented
- On-call rotation configured (PagerDuty/Opsgenie)
- Escalation path defined for each severity
- War room channel/bridge set up
- Playbooks written for top 5 incident types
- Communication templates drafted
- Post-mortem template standardized
- Tabletop exercise scheduled quarterly
- Action items tracked to completion
- Metrics tracked: MTTA, MTTR, incidents/month
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For SRE consulting, visit garnetgrid.com. :::