Verified by Garnet Grid

How to Build an Effective Incident Response Playbook

Build and test incident response playbooks for your team. Covers severity classification, communication templates, war room procedures, and post-mortem frameworks.

Organizations with tested incident response plans save an average of $2.66M per breach compared to those without. The plan isn’t the document — it’s the rehearsal.


Step 1: Severity Classification

SeverityDefinitionResponse TimeExample
SEV-1 (Critical)Revenue-impacting, data breach, full outage15 minPayment system down, data exfiltrated
SEV-2 (High)Partial outage, degraded performance30 minAPI error rate > 10%, one region down
SEV-3 (Medium)Minor impact, workaround exists2 hoursSlow queries, non-critical service down
SEV-4 (Low)No user impact, internal onlyNext business dayMonitoring alert, log anomaly

Step 2: On-Call Structure

# PagerDuty schedule structure
on_call_rotation:
  primary:
    schedule: weekly_rotation
    escalation:
      - wait: 5_minutes
        target: secondary_on_call
      - wait: 10_minutes
        target: engineering_manager
      - wait: 15_minutes
        target: vp_engineering

  incident_commander_pool:
    - senior_engineer_1
    - senior_engineer_2
    - engineering_manager_1
    - engineering_manager_2

  roles:
    incident_commander: "Coordinates response, makes decisions"
    communications_lead: "Updates stakeholders, customers"
    technical_lead: "Directs debugging and remediation"
    scribe: "Documents timeline and decisions"

Step 3: Response Playbook Template

## Incident Playbook: [Incident Type]

### Detection
- Alert source: [PagerDuty, Datadog, customer report]
- Initial indicators: [what triggers this playbook]

### Triage (First 15 Minutes)
1. Acknowledge alert in PagerDuty
2. Join war room: [Slack channel / Zoom link]
3. Assign roles: IC, Comms Lead, Tech Lead, Scribe
4. Assess severity using classification matrix
5. Start incident document from template

### Diagnosis
1. Check dashboards: [links to relevant dashboards]
2. Check recent deployments: [CI/CD link]
3. Check infrastructure: [CloudWatch / Datadog link]
4. Run diagnostic commands:
   ```bash
   kubectl get pods -n production
   kubectl logs -n production -l app=api --tail=100
   curl -s https://api.company.com/health | jq .

Remediation

  • Option A: Rollback last deployment

    kubectl rollout undo deployment/api -n production
  • Option B: Scale up to handle load

    kubectl scale deployment/api --replicas=10 -n production
  • Option C: Failover to DR

    ./scripts/failover-to-dr.sh

Communication Templates

Internal (Slack #incidents):

🔴 SEV-[X] Incident: [Title] Impact: [description] IC: @[name] Status: Investigating / Identified / Monitoring / Resolved Next update: [time]

External (Status Page):

We are currently experiencing [issue description]. This impacts [affected services]. Our team is actively working on resolution. We will provide an update by [time].

Resolution

  1. Verify service is healthy (dashboard green)
  2. Monitor for 30 minutes post-fix
  3. Update status page → Resolved
  4. Schedule post-mortem within 48 hours

---

## Step 4: Post-Mortem Framework

```markdown
## Post-Mortem: [Incident Title]

**Date:** [date]
**Duration:** [start – end]
**Severity:** SEV-[X]
**Incident Commander:** [name]

### Summary
[2-3 sentence description of what happened and impact]

### Timeline
| Time (UTC) | Event |
|---|---|
| 14:02 | Alert triggered: API error rate > 5% |
| 14:05 | On-call acknowledged, joined war room |
| 14:12 | Root cause identified: database connection pool exhausted |
| 14:18 | Fix applied: increased pool size from 20 to 100 |
| 14:25 | Service recovered, monitoring |
| 14:55 | Incident resolved |

### Root Cause
[Detailed technical explanation]

### Impact
- [X] customers affected
- [Y] minutes of downtime
- $[Z] estimated revenue impact

### What Went Well
- Alert fired within 2 minutes
- War room assembled in 5 minutes
- Fix applied in 15 minutes

### What Could Be Improved
- Connection pool limits weren't monitored
- Runbook didn't cover this specific scenario
- Customer communication was delayed

### Action Items
| Action | Owner | Due Date |
|---|---|---|
| Add connection pool monitoring | @engineer | [date] |
| Update runbook with DB pool playbook | @sre | [date] |
| Automate customer notification | @platform | [date] |
| Load test with 2x normal traffic | @qa | [date] |

Incident Response Checklist

  • Severity classification defined and documented
  • On-call rotation configured (PagerDuty/Opsgenie)
  • Escalation path defined for each severity
  • War room channel/bridge set up
  • Playbooks written for top 5 incident types
  • Communication templates drafted
  • Post-mortem template standardized
  • Tabletop exercise scheduled quarterly
  • Action items tracked to completion
  • Metrics tracked: MTTA, MTTR, incidents/month

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For SRE consulting, visit garnetgrid.com. :::