How to Build a Disaster Recovery Plan

A disaster recovery plan that hasn’t been tested isn’t a plan — it’s a theory. 37% of organizations fail their first real DR activation because they never practiced.

Step 1: Define RTO and RPO Targets

Term	Definition	Business Question
RTO (Recovery Time Objective)	Max acceptable downtime	”How long can we be down?”
RPO (Recovery Point Objective)	Max acceptable data loss	”How much data can we lose?”

Tiers by Business Criticality

Tier	Systems	RTO	RPO	Strategy
Tier 1: Mission Critical	Payment processing, auth, core API	< 15 min	0 (zero data loss)	Active-Active / Hot Standby
Tier 2: Business Critical	CRM, ERP, dashboards	< 1 hour	< 15 min	Warm Standby
Tier 3: Essential	Email, file shares, internal tools	< 4 hours	< 1 hour	Pilot Light
Tier 4: Non-Critical	Dev environments, archives	< 24 hours	< 24 hours	Backup & Restore

Step 2: Implement Backup Strategy

# AWS — automated RDS snapshots
aws rds modify-db-instance \
  --db-instance-identifier production-db \
  --backup-retention-period 30 \
  --preferred-backup-window "03:00-04:00"

# Cross-region copy for DR
aws rds copy-db-snapshot \
  --source-db-snapshot-arn arn:aws:rds:us-east-1:123456789:snapshot:prod-snap \
  --target-db-snapshot-identifier prod-snap-dr \
  --source-region us-east-1 \
  --region us-west-2

# S3 cross-region replication
aws s3api put-bucket-replication \
  --bucket production-data \
  --replication-configuration '{
    "Role": "arn:aws:iam::role/s3-replication-role",
    "Rules": [{
      "Status": "Enabled",
      "Destination": {
        "Bucket": "arn:aws:s3:::production-data-dr-west",
        "StorageClass": "STANDARD_IA"
      }
    }]
  }'

Step 3: Automate Failover

# Kubernetes — multi-region failover with external-dns
# Route 53 health check + failover routing
apiVersion: externaldns.k8s.io/v1alpha1
kind: DNSEndpoint
metadata:
  name: api-failover
spec:
  endpoints:
    - dnsName: api.yourcompany.com
      recordType: A
      targets:
        - 10.1.0.50   # Primary (us-east-1)
      setIdentifier: primary
      providerSpecific:
        - name: aws/failover
          value: PRIMARY
        - name: aws/health-check-id
          value: "abc-123-health-check"
    - dnsName: api.yourcompany.com
      recordType: A
      targets:
        - 10.2.0.50   # Secondary (us-west-2)
      setIdentifier: secondary
      providerSpecific:
        - name: aws/failover
          value: SECONDARY

Step 4: Test Your DR Plan

Tabletop Exercise (Quarterly)

## DR Tabletop Exercise Template

**Scenario:** Primary database server is corrupted at 2 AM Saturday.

**Walk through these questions:**

1. Who receives the first alert? (pager, Slack, email?)
2. What's the escalation path? (on-call → lead → VP)
3. How do we confirm it's a real disaster vs a monitoring false positive?
4. What's the failover command? Who has access to run it?
5. How long does failover take? (measured, not estimated)
6. How do we verify the DR environment is serving traffic correctly?
7. What data was lost between last backup and failure?
8. How do we communicate to customers? Template ready?
9. How do we failback to primary after recovery?
10. What's our post-mortem process?

Technical DR Test (Semi-Annual)

# Automated DR test script
#!/bin/bash
set -e

echo "=== DR Test Started ==="
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# 1. Restore database backup in DR region
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier "dr-test-${TIMESTAMP}" \
  --db-snapshot-identifier "latest-cross-region-snap" \
  --region us-west-2

# 2. Wait for restoration
aws rds wait db-instance-available \
  --db-instance-identifier "dr-test-${TIMESTAMP}" \
  --region us-west-2

# 3. Validate data integrity
psql -h dr-test-endpoint -c "SELECT COUNT(*) FROM orders WHERE order_date > NOW() - INTERVAL '1 day';"

# 4. Run smoke tests against DR environment
curl -f https://dr-api.yourcompany.com/health || echo "FAIL: API health check"

# 5. Cleanup
aws rds delete-db-instance \
  --db-instance-identifier "dr-test-${TIMESTAMP}" \
  --skip-final-snapshot \
  --region us-west-2

echo "=== DR Test Complete ==="

DR Cost Estimation

Strategy	Monthly Cost (relative to production)	RTO	RPO
Backup & Restore	5-10%	4-24 hours	Hours
Pilot Light	10-20%	1-4 hours	Minutes
Warm Standby	30-50%	15-60 min	Minutes
Active-Active	100%+	Near-zero	Zero

DR Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For DR assessment consulting, visit garnetgrid.com. :::