Verified by Garnet Grid

How to Build a Disaster Recovery Plan

Design and test disaster recovery for cloud and on-prem workloads. Covers RTO/RPO targets, backup strategies, failover automation, and tabletop exercises.

A disaster recovery plan that hasn’t been tested isn’t a plan — it’s a theory. 37% of organizations fail their first real DR activation because they never practiced.


Step 1: Define RTO and RPO Targets

TermDefinitionBusiness Question
RTO (Recovery Time Objective)Max acceptable downtime”How long can we be down?”
RPO (Recovery Point Objective)Max acceptable data loss”How much data can we lose?”

Tiers by Business Criticality

TierSystemsRTORPOStrategy
Tier 1: Mission CriticalPayment processing, auth, core API< 15 min0 (zero data loss)Active-Active / Hot Standby
Tier 2: Business CriticalCRM, ERP, dashboards< 1 hour< 15 minWarm Standby
Tier 3: EssentialEmail, file shares, internal tools< 4 hours< 1 hourPilot Light
Tier 4: Non-CriticalDev environments, archives< 24 hours< 24 hoursBackup & Restore

Step 2: Implement Backup Strategy

# AWS — automated RDS snapshots
aws rds modify-db-instance \
  --db-instance-identifier production-db \
  --backup-retention-period 30 \
  --preferred-backup-window "03:00-04:00"

# Cross-region copy for DR
aws rds copy-db-snapshot \
  --source-db-snapshot-arn arn:aws:rds:us-east-1:123456789:snapshot:prod-snap \
  --target-db-snapshot-identifier prod-snap-dr \
  --source-region us-east-1 \
  --region us-west-2

# S3 cross-region replication
aws s3api put-bucket-replication \
  --bucket production-data \
  --replication-configuration '{
    "Role": "arn:aws:iam::role/s3-replication-role",
    "Rules": [{
      "Status": "Enabled",
      "Destination": {
        "Bucket": "arn:aws:s3:::production-data-dr-west",
        "StorageClass": "STANDARD_IA"
      }
    }]
  }'

Step 3: Automate Failover

# Kubernetes — multi-region failover with external-dns
# Route 53 health check + failover routing
apiVersion: externaldns.k8s.io/v1alpha1
kind: DNSEndpoint
metadata:
  name: api-failover
spec:
  endpoints:
    - dnsName: api.yourcompany.com
      recordType: A
      targets:
        - 10.1.0.50   # Primary (us-east-1)
      setIdentifier: primary
      providerSpecific:
        - name: aws/failover
          value: PRIMARY
        - name: aws/health-check-id
          value: "abc-123-health-check"
    - dnsName: api.yourcompany.com
      recordType: A
      targets:
        - 10.2.0.50   # Secondary (us-west-2)
      setIdentifier: secondary
      providerSpecific:
        - name: aws/failover
          value: SECONDARY

Step 4: Test Your DR Plan

Tabletop Exercise (Quarterly)

## DR Tabletop Exercise Template

**Scenario:** Primary database server is corrupted at 2 AM Saturday.

**Walk through these questions:**

1. Who receives the first alert? (pager, Slack, email?)
2. What's the escalation path? (on-call → lead → VP)
3. How do we confirm it's a real disaster vs a monitoring false positive?
4. What's the failover command? Who has access to run it?
5. How long does failover take? (measured, not estimated)
6. How do we verify the DR environment is serving traffic correctly?
7. What data was lost between last backup and failure?
8. How do we communicate to customers? Template ready?
9. How do we failback to primary after recovery?
10. What's our post-mortem process?

Technical DR Test (Semi-Annual)

# Automated DR test script
#!/bin/bash
set -e

echo "=== DR Test Started ==="
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# 1. Restore database backup in DR region
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier "dr-test-${TIMESTAMP}" \
  --db-snapshot-identifier "latest-cross-region-snap" \
  --region us-west-2

# 2. Wait for restoration
aws rds wait db-instance-available \
  --db-instance-identifier "dr-test-${TIMESTAMP}" \
  --region us-west-2

# 3. Validate data integrity
psql -h dr-test-endpoint -c "SELECT COUNT(*) FROM orders WHERE order_date > NOW() - INTERVAL '1 day';"

# 4. Run smoke tests against DR environment
curl -f https://dr-api.yourcompany.com/health || echo "FAIL: API health check"

# 5. Cleanup
aws rds delete-db-instance \
  --db-instance-identifier "dr-test-${TIMESTAMP}" \
  --skip-final-snapshot \
  --region us-west-2

echo "=== DR Test Complete ==="

DR Cost Estimation

StrategyMonthly Cost (relative to production)RTORPO
Backup & Restore5-10%4-24 hoursHours
Pilot Light10-20%1-4 hoursMinutes
Warm Standby30-50%15-60 minMinutes
Active-Active100%+Near-zeroZero

DR Checklist

  • RTO/RPO defined for every Tier 1 & 2 system
  • Automated backups configured with cross-region replication
  • Database backup restoration tested monthly
  • DNS failover configured and tested
  • DR runbook documented with step-by-step commands
  • On-call rotation includes DR responsibilities
  • Tabletop exercise completed quarterly
  • Full DR test completed semi-annually
  • Customer communication templates prepared
  • Post-mortem process defined for DR activations

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For DR assessment consulting, visit garnetgrid.com. :::