How to Build a Disaster Recovery Plan
Design and test disaster recovery for cloud and on-prem workloads. Covers RTO/RPO targets, backup strategies, failover automation, and tabletop exercises.
A disaster recovery plan that hasn’t been tested isn’t a plan — it’s a theory. 37% of organizations fail their first real DR activation because they never practiced.
Step 1: Define RTO and RPO Targets
| Term | Definition | Business Question |
|---|---|---|
| RTO (Recovery Time Objective) | Max acceptable downtime | ”How long can we be down?” |
| RPO (Recovery Point Objective) | Max acceptable data loss | ”How much data can we lose?” |
Tiers by Business Criticality
| Tier | Systems | RTO | RPO | Strategy |
|---|---|---|---|---|
| Tier 1: Mission Critical | Payment processing, auth, core API | < 15 min | 0 (zero data loss) | Active-Active / Hot Standby |
| Tier 2: Business Critical | CRM, ERP, dashboards | < 1 hour | < 15 min | Warm Standby |
| Tier 3: Essential | Email, file shares, internal tools | < 4 hours | < 1 hour | Pilot Light |
| Tier 4: Non-Critical | Dev environments, archives | < 24 hours | < 24 hours | Backup & Restore |
Step 2: Implement Backup Strategy
# AWS — automated RDS snapshots
aws rds modify-db-instance \
--db-instance-identifier production-db \
--backup-retention-period 30 \
--preferred-backup-window "03:00-04:00"
# Cross-region copy for DR
aws rds copy-db-snapshot \
--source-db-snapshot-arn arn:aws:rds:us-east-1:123456789:snapshot:prod-snap \
--target-db-snapshot-identifier prod-snap-dr \
--source-region us-east-1 \
--region us-west-2
# S3 cross-region replication
aws s3api put-bucket-replication \
--bucket production-data \
--replication-configuration '{
"Role": "arn:aws:iam::role/s3-replication-role",
"Rules": [{
"Status": "Enabled",
"Destination": {
"Bucket": "arn:aws:s3:::production-data-dr-west",
"StorageClass": "STANDARD_IA"
}
}]
}'
Step 3: Automate Failover
# Kubernetes — multi-region failover with external-dns
# Route 53 health check + failover routing
apiVersion: externaldns.k8s.io/v1alpha1
kind: DNSEndpoint
metadata:
name: api-failover
spec:
endpoints:
- dnsName: api.yourcompany.com
recordType: A
targets:
- 10.1.0.50 # Primary (us-east-1)
setIdentifier: primary
providerSpecific:
- name: aws/failover
value: PRIMARY
- name: aws/health-check-id
value: "abc-123-health-check"
- dnsName: api.yourcompany.com
recordType: A
targets:
- 10.2.0.50 # Secondary (us-west-2)
setIdentifier: secondary
providerSpecific:
- name: aws/failover
value: SECONDARY
Step 4: Test Your DR Plan
Tabletop Exercise (Quarterly)
## DR Tabletop Exercise Template
**Scenario:** Primary database server is corrupted at 2 AM Saturday.
**Walk through these questions:**
1. Who receives the first alert? (pager, Slack, email?)
2. What's the escalation path? (on-call → lead → VP)
3. How do we confirm it's a real disaster vs a monitoring false positive?
4. What's the failover command? Who has access to run it?
5. How long does failover take? (measured, not estimated)
6. How do we verify the DR environment is serving traffic correctly?
7. What data was lost between last backup and failure?
8. How do we communicate to customers? Template ready?
9. How do we failback to primary after recovery?
10. What's our post-mortem process?
Technical DR Test (Semi-Annual)
# Automated DR test script
#!/bin/bash
set -e
echo "=== DR Test Started ==="
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# 1. Restore database backup in DR region
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier "dr-test-${TIMESTAMP}" \
--db-snapshot-identifier "latest-cross-region-snap" \
--region us-west-2
# 2. Wait for restoration
aws rds wait db-instance-available \
--db-instance-identifier "dr-test-${TIMESTAMP}" \
--region us-west-2
# 3. Validate data integrity
psql -h dr-test-endpoint -c "SELECT COUNT(*) FROM orders WHERE order_date > NOW() - INTERVAL '1 day';"
# 4. Run smoke tests against DR environment
curl -f https://dr-api.yourcompany.com/health || echo "FAIL: API health check"
# 5. Cleanup
aws rds delete-db-instance \
--db-instance-identifier "dr-test-${TIMESTAMP}" \
--skip-final-snapshot \
--region us-west-2
echo "=== DR Test Complete ==="
DR Cost Estimation
| Strategy | Monthly Cost (relative to production) | RTO | RPO |
|---|---|---|---|
| Backup & Restore | 5-10% | 4-24 hours | Hours |
| Pilot Light | 10-20% | 1-4 hours | Minutes |
| Warm Standby | 30-50% | 15-60 min | Minutes |
| Active-Active | 100%+ | Near-zero | Zero |
DR Checklist
- RTO/RPO defined for every Tier 1 & 2 system
- Automated backups configured with cross-region replication
- Database backup restoration tested monthly
- DNS failover configured and tested
- DR runbook documented with step-by-step commands
- On-call rotation includes DR responsibilities
- Tabletop exercise completed quarterly
- Full DR test completed semi-annually
- Customer communication templates prepared
- Post-mortem process defined for DR activations
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For DR assessment consulting, visit garnetgrid.com. :::