How to Build an Effective Incident Response Playbook

Organizations with tested incident response plans save an average of $2.66M per breach compared to those without. The plan isn’t the document — it’s the rehearsal.

Step 1: Severity Classification

Severity	Definition	Response Time	Example
SEV-1 (Critical)	Revenue-impacting, data breach, full outage	15 min	Payment system down, data exfiltrated
SEV-2 (High)	Partial outage, degraded performance	30 min	API error rate > 10%, one region down
SEV-3 (Medium)	Minor impact, workaround exists	2 hours	Slow queries, non-critical service down
SEV-4 (Low)	No user impact, internal only	Next business day	Monitoring alert, log anomaly

Step 2: On-Call Structure

# PagerDuty schedule structure
on_call_rotation:
  primary:
    schedule: weekly_rotation
    escalation:
      - wait: 5_minutes
        target: secondary_on_call
      - wait: 10_minutes
        target: engineering_manager
      - wait: 15_minutes
        target: vp_engineering

  incident_commander_pool:
    - senior_engineer_1
    - senior_engineer_2
    - engineering_manager_1
    - engineering_manager_2

  roles:
    incident_commander: "Coordinates response, makes decisions"
    communications_lead: "Updates stakeholders, customers"
    technical_lead: "Directs debugging and remediation"
    scribe: "Documents timeline and decisions"

Step 3: Response Playbook Template

## Incident Playbook: [Incident Type]

### Detection
- Alert source: [PagerDuty, Datadog, customer report]
- Initial indicators: [what triggers this playbook]

### Triage (First 15 Minutes)
1. Acknowledge alert in PagerDuty
2. Join war room: [Slack channel / Zoom link]
3. Assign roles: IC, Comms Lead, Tech Lead, Scribe
4. Assess severity using classification matrix
5. Start incident document from template

### Diagnosis
1. Check dashboards: [links to relevant dashboards]
2. Check recent deployments: [CI/CD link]
3. Check infrastructure: [CloudWatch / Datadog link]
4. Run diagnostic commands:
   ```bash
   kubectl get pods -n production
   kubectl logs -n production -l app=api --tail=100
   curl -s https://api.company.com/health | jq .

Remediation

Option A: Rollback last deployment

kubectl rollout undo deployment/api -n production

Option B: Scale up to handle load

kubectl scale deployment/api --replicas=10 -n production

Option C: Failover to DR
```
./scripts/failover-to-dr.sh
```

Communication Templates

Internal (Slack #incidents):

🔴 SEV-[X] Incident: [Title] Impact: [description] IC: @[name] Status: Investigating / Identified / Monitoring / Resolved Next update: [time]

External (Status Page):

We are currently experiencing [issue description]. This impacts [affected services]. Our team is actively working on resolution. We will provide an update by [time].

Resolution

Verify service is healthy (dashboard green)
Monitor for 30 minutes post-fix
Update status page → Resolved
Schedule post-mortem within 48 hours


---

## Step 4: Post-Mortem Framework

```markdown
## Post-Mortem: [Incident Title]

**Date:** [date]
**Duration:** [start – end]
**Severity:** SEV-[X]
**Incident Commander:** [name]

### Summary
[2-3 sentence description of what happened and impact]

### Timeline
| Time (UTC) | Event |
|---|---|
| 14:02 | Alert triggered: API error rate > 5% |
| 14:05 | On-call acknowledged, joined war room |
| 14:12 | Root cause identified: database connection pool exhausted |
| 14:18 | Fix applied: increased pool size from 20 to 100 |
| 14:25 | Service recovered, monitoring |
| 14:55 | Incident resolved |

### Root Cause
[Detailed technical explanation]

### Impact
- [X] customers affected
- [Y] minutes of downtime
- $[Z] estimated revenue impact

### What Went Well
- Alert fired within 2 minutes
- War room assembled in 5 minutes
- Fix applied in 15 minutes

### What Could Be Improved
- Connection pool limits weren't monitored
- Runbook didn't cover this specific scenario
- Customer communication was delayed

### Action Items
| Action | Owner | Due Date |
|---|---|---|
| Add connection pool monitoring | @engineer | [date] |
| Update runbook with DB pool playbook | @sre | [date] |
| Automate customer notification | @platform | [date] |
| Load test with 2x normal traffic | @qa | [date] |

Incident Response Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For SRE consulting, visit garnetgrid.com. :::

Step 1: Severity Classification

Step 2: On-Call Structure

Step 3: Response Playbook Template

Remediation

Communication Templates

Resolution

Incident Response Checklist

More in Security

How to Secure Your CI/CD Pipeline: Vulnerability Scanning and Access Control

Cloud Security Posture Management: Hardening Your Cloud Environment

How to Identify and Fix Cybersecurity Blind Spots