Explain the importance of resilience and recovery in security architecture
Systems will fail. Breaches will happen. Resilience is about continuing to operate during disruption; recovery is about getting back to normal after. Both must be designed into the architecture, not bolted on after an incident.
High Availability
Load Balancing
Distributing workloads across multiple servers so no single server is a bottleneck or single point of failure.
- Active-active: All servers handle traffic simultaneously. Best performance and redundancy.
- Active-passive: Standby servers take over only when the primary fails. Simpler but wastes idle capacity.
- Layer 4 (transport) vs. Layer 7 (application) load balancing
Clustering
Multiple servers operating as a single logical system. If one node fails, others continue serving.
- Database clusters, application server clusters
- Requires shared or replicated state between nodes
Redundancy
Eliminating single points of failure across every layer:
- Server: Multiple servers behind load balancers
- Network: Dual ISPs, redundant switches/routers, diverse routing paths
- Power: Dual power supplies, UPSUninterruptible Power Supply — Battery backup for power outages (Uninterruptible Power Supply), generators
- Storage: RAIDRedundant Array of Independent Disks — Storage redundancy through multiple disks, replicated databases, distributed storage
- Site: Geographic redundancy (hot/warm/cold sites)
RAID (Redundant Array of Independent Disks)
| Level | Description | Fault Tolerance | Min Disks |
|---|---|---|---|
| RAIDRedundant Array of Independent Disks — Storage redundancy through multiple disks 0 | Striping. Performance, no redundancy. | None — one disk fails, all data lost | 2 |
| RAIDRedundant Array of Independent Disks — Storage redundancy through multiple disks 1 | Mirroring. Exact copy on two disks. | Survives one disk failure | 2 |
| RAIDRedundant Array of Independent Disks — Storage redundancy through multiple disks 5 | Striping with distributed parity. | Survives one disk failure | 3 |
| RAIDRedundant Array of Independent Disks — Storage redundancy through multiple disks 6 | Striping with double parity. | Survives two disk failures | 4 |
| RAIDRedundant Array of Independent Disks — Storage redundancy through multiple disks 10 | Mirror + stripe (1+0). | Survives one failure per mirror pair | 4 |
Exam tip: RAIDRedundant Array of Independent Disks — Storage redundancy through multiple disks is not a backup. It protects against disk failure, not against data corruption, ransomware, or accidental deletion. You still need backups.
Backup Strategies
Types
- Full: Complete copy of all data. Slowest to create, fastest to restore.
- Incremental: Only data changed since the last backup (any type). Fast to create, slower to restore (need full + all incrementals).
- Differential: All data changed since the last full backup. Middle ground — larger than incremental, faster to restore.
- Snapshot: Point-in-time copy of a volume or VMVirtual Machine — Software emulation of a computer system state. Fast creation through copy-on-write. Used for quick recovery but not a replacement for offsite backups.
3-2-1 Rule
- 3 copies of data (primary + 2 backups)
- 2 different media types (disk + tape, disk + cloud)
- 1 offsite copy (geographically separate)
Backup Encryption
A notable gap that CompTIA tests:
- Backups contain the same sensitive data as production — they must be encrypted
- At rest: AES-256Advanced Encryption Standard 256-bit — AES with 256-bit key length encryption on backup storage (tape, disk, cloud)
- In transit: TLSTransport Layer Security — Port 443 (HTTPS). Encryption protocol for data in transit for network-based backup transfers
- Key management: Backup encryption keys must be stored separately from the backups themselves. If the keys are on the backup server and it’s compromised, encryption is worthless.
- Ransomware consideration: Attackers target backup systems specifically. Encrypted, immutable (WORM — Write Once Read Many), offsite backups are the last line of defense.
- Key escrow for backups: If the person who manages encryption keys leaves, you need a way to decrypt backups. Key escrow or split-key arrangements prevent crypto-lockout.
Testing
Backups that aren’t tested are assumptions, not backups. Regular restore tests verify:
- Data integrity (restored data matches original)
- Recovery time (meets RTORecovery Time Objective — Maximum acceptable downtime)
- Process documentation accuracy
Replication
Synchronous vs. Asynchronous
| Type | How It Works | RPO | Latency Impact | Use Case |
|---|---|---|---|---|
| Synchronous | Write is confirmed only after both primary AND replica acknowledge | Zero — no data loss | Higher (write waits for remote confirmation) | Financial transactions, databases where any data loss is unacceptable |
| Asynchronous | Write is confirmed after primary acknowledges; replica updates afterward | Non-zero — lag between primary and replica | Lower (write doesn’t wait) | Cross-region replication, backup sites, most general workloads |
Exam tip: If the question describes zero RPORecovery Point Objective — Maximum acceptable data loss (in time), the answer involves synchronous replication. If the question mentions geographic distance (cross-region, cross-country), asynchronous is more practical because synchronous replication over long distances introduces unacceptable latency.
Database Recovery
Point-in-Time Recovery (PITR)
Restore a database to any specific moment in time, not just the last backup:
- Combines a full backup + transaction logs (replay transactions up to the desired timestamp)
- Essential for recovering from data corruption or accidental deletion (“undo the DROP TABLE that happened at 2:43 PM”)
- Most enterprise databases support PITR: PostgreSQL, MySQL, Oracle, SQLStructured Query Language — Language for database queries Server
Write-Ahead Logging (WAL)
- All database changes are written to a log before being applied to the data files
- If the database crashes, it replays the WAL to recover committed transactions that weren’t yet written to disk
- WAL files can be archived and shipped to a replica for continuous replication (log shipping)
- PostgreSQL example: WAL archiving + PITR enables continuous backup with near-zero RPORecovery Point Objective — Maximum acceptable data loss (in time)
Transaction Logs
- Sequential record of every database transaction (inserts, updates, deletes)
- Used for: crash recovery (replay committed transactions), replication (ship to replicas), auditing (who changed what when), PITR (replay to specific point)
- Retention: Transaction logs can grow very large. Retention policy must balance recovery capability against storage cost.
Journaling
File system journaling records metadata changes (and optionally data changes) before they’re committed to the file system:
- If the system crashes mid-write, the journal is replayed on recovery to bring the file system to a consistent state
- Prevents file system corruption — without journaling, a crash during a write operation could leave the file system in an inconsistent state requiring a full check (fsck)
- Modern file systems use journaling by default: ext4 (Linux), NTFSNew Technology File System — Windows file system supporting permissions and encryption (Windows), APFS (macOS)
- Metadata-only journaling: Faster, protects file system structure but not file contents
- Full journaling: Slower, protects both metadata and file data
Exam context: Journaling is a resilience mechanism for storage systems, not a backup mechanism. It prevents corruption from unexpected shutdowns, not from deliberate attacks or hardware failure.
Recovery Sites
Hot Site
Fully equipped, real-time data replication, ready to take over immediately.
- RTORecovery Time Objective — Maximum acceptable downtime: Minutes to hours
- Most expensive. Maintained continuously with live data sync.
Warm Site
Equipment installed but not fully configured. Data must be restored from backups.
- RTORecovery Time Objective — Maximum acceptable downtime: Hours to days
- Balance between cost and recovery speed
Cold Site
Empty facility with power, cooling, and network connectivity. No equipment pre-installed.
- RTORecovery Time Objective — Maximum acceptable downtime: Days to weeks
- Cheapest. Equipment must be procured and configured after activation.
Cloud-Based Recovery
- DRaaSDisaster Recovery as a Service — Cloud-hosted disaster recovery environment (Disaster Recovery as a Service): Cloud provider hosts your recovery environment
- Scales on demand — pay for standby capacity, spin up full environment during disaster
- Eliminates physical site management
Selection Decision Logic
When to Choose Which Backup Type
| If the scenario says… | Choose… | Because… |
|---|---|---|
| ”Fastest possible restore” | Full | Single backup to restore, no chaining |
| ”Minimize backup window / fastest backup” | Incremental | Only changes since last backup (any type) |
| “Balance backup speed and restore speed” | Differential | Only changes since last full, faster restore than incremental |
| ”Instant rollback of a VM” | Snapshot | Point-in-time copy, seconds to create |
| ”RPO of zero” | Continuous replication | Not a traditional backup — real-time sync |
When to Choose Which Recovery Site
| If the scenario says… | Choose… | Because… |
|---|---|---|
| ”RTO of minutes” or “zero downtime” | Hot site | Live data, ready to go immediately |
| ”RTO of hours” or “cost-conscious with reasonable recovery” | Warm site | Equipment ready, data needs restoration |
| ”RTO of days/weeks is acceptable” or “minimal budget” | Cold site | Facility only, everything must be provisioned |
| ”Elastic scaling during disaster” | Cloud DR (DRaaSDisaster Recovery as a Service — Cloud-hosted disaster recovery environment) | Pay for what you use, scales on demand |
Recovery Site Comparison
| Attribute | Hot | Warm | Cold | Cloud DR |
|---|---|---|---|---|
| RTORecovery Time Objective — Maximum acceptable downtime | Minutes | Hours | Days-weeks | Minutes-hours |
| Cost | Highest | Moderate | Lowest | Variable (usage-based) |
| Data currency | Real-time | Last backup | Last backup | Configurable |
| Maintenance | Continuous | Periodic | Minimal | Provider-managed |
| Best for | Mission-critical (finance, healthcare) | Important but not real-time critical | Budget-constrained, low-priority systems | Organizations already in cloud |
Recovery Metrics
RTO (Recovery Time Objective)
Maximum acceptable time to restore operations after a disruption.
- “We must be back online within 4 hours.”
- Drives decisions about recovery site type, backup strategy, and automation investment.
RPO (Recovery Point Objective)
Maximum acceptable amount of data loss measured in time.
- “We can afford to lose no more than 1 hour of data.”
- RPORecovery Point Objective — Maximum acceptable data loss (in time) of 1 hour → backups/replication must happen at least hourly.
- RPORecovery Point Objective — Maximum acceptable data loss (in time) of 0 → requires real-time synchronous replication.
MTBF (Mean Time Between Failures)
Average time a system operates before failing. Measure of reliability.
- Higher MTBFMean Time Between Failures — Average uptime between failures = more reliable system
- Used to predict failure frequency and plan maintenance
MTTR (Mean Time to Repair)
Average time to restore a system after failure. Measure of maintainability.
- Lower MTTRMean Time to Repair — Average time to restore after failure = faster recovery
- Improved by automation, documentation, spare parts availability
Power Resilience
UPS (Uninterruptible Power Supply)
Battery backup that provides immediate power during outages.
- Bridges the gap between power loss and generator startup (typically 10-30 seconds)
- Also provides power conditioning (surge protection, voltage regulation)
Generator
Diesel or natural gas backup power for extended outages.
- Requires fuel supply and regular testing
- Automatic transfer switch (ATS) manages failover from utility to generator
PDU (Power Distribution Unit)
Distributes power to rack-mounted equipment. Managed PDUs allow remote power cycling and monitoring.
Capacity Planning
Ensuring infrastructure can handle current and future demand:
- People: Sufficient staff for security operations, incident response, recovery
- Technology: Processing capacity, storage, network bandwidth
- Infrastructure: Data center space, power, cooling
Underprovisioning creates availability risk. Overprovisioning wastes resources. Capacity planning balances both against business requirements and growth projections.
Offensive Context
Resilience is what prevents a successful attack from becoming a catastrophe. An attacker who detonates ransomware against an org with tested backups, a warm site, and a 4-hour RTORecovery Time Objective — Maximum acceptable downtime has caused an inconvenience. The same attack against an org with untested backups and no recovery site is an existential threat. Attackers increasingly target backup systems specifically (deleting shadow copies, encrypting backup servers, compromising cloud backup credentials) because they know destroying recovery capability maximizes leverage. Your backup architecture must assume the attacker will try to destroy it.