3.4: Explain the importance of resilience and recovery in security architecture // Wolf//Sec

Systems will fail. Breaches will happen. Resilience is about continuing to operate during disruption; recovery is about getting back to normal after. Both must be designed into the architecture, not bolted on after an incident.

High Availability

Load Balancing

Distributing workloads across multiple servers so no single server is a bottleneck or single point of failure.

Active-active: All servers handle traffic simultaneously. Best performance and redundancy.
Active-passive: Standby servers take over only when the primary fails. Simpler but wastes idle capacity.
Layer 4 (transport) vs. Layer 7 (application) load balancing

Clustering

Multiple servers operating as a single logical system. If one node fails, others continue serving.

Database clusters, application server clusters
Requires shared or replicated state between nodes

Redundancy

Eliminating single points of failure across every layer:

Server: Multiple servers behind load balancers
Network: Dual ISPs, redundant switches/routers, diverse routing paths
Power: Dual power supplies, UPSUninterruptible Power Supply — Battery backup for power outages (Uninterruptible Power Supply), generators
Storage: RAIDRedundant Array of Independent Disks — Storage redundancy through multiple disks, replicated databases, distributed storage
Site: Geographic redundancy (hot/warm/cold sites)

RAID (Redundant Array of Independent Disks)

Level	Description	Fault Tolerance	Min Disks
RAIDRedundant Array of Independent Disks — Storage redundancy through multiple disks 0	Striping. Performance, no redundancy.	None — one disk fails, all data lost	2
RAIDRedundant Array of Independent Disks — Storage redundancy through multiple disks 1	Mirroring. Exact copy on two disks.	Survives one disk failure	2
RAIDRedundant Array of Independent Disks — Storage redundancy through multiple disks 5	Striping with distributed parity.	Survives one disk failure	3
RAIDRedundant Array of Independent Disks — Storage redundancy through multiple disks 6	Striping with double parity.	Survives two disk failures	4
RAIDRedundant Array of Independent Disks — Storage redundancy through multiple disks 10	Mirror + stripe (1+0).	Survives one failure per mirror pair	4

Exam tip: RAIDRedundant Array of Independent Disks — Storage redundancy through multiple disks is not a backup. It protects against disk failure, not against data corruption, ransomware, or accidental deletion. You still need backups.

Backup Strategies

Types

Full: Complete copy of all data. Slowest to create, fastest to restore.
Incremental: Only data changed since the last backup (any type). Fast to create, slower to restore (need full + all incrementals).
Differential: All data changed since the last full backup. Middle ground — larger than incremental, faster to restore.
Snapshot: Point-in-time copy of a volume or VMVirtual Machine — Software emulation of a computer system state. Fast creation through copy-on-write. Used for quick recovery but not a replacement for offsite backups.

3-2-1 Rule

3 copies of data (primary + 2 backups)
2 different media types (disk + tape, disk + cloud)
1 offsite copy (geographically separate)

Backup Encryption

A notable gap that CompTIA tests:

Backups contain the same sensitive data as production — they must be encrypted
At rest: AES-256Advanced Encryption Standard 256-bit — AES with 256-bit key length encryption on backup storage (tape, disk, cloud)
In transit: TLSTransport Layer Security — Port 443 (HTTPS). Encryption protocol for data in transit for network-based backup transfers
Key management: Backup encryption keys must be stored separately from the backups themselves. If the keys are on the backup server and it’s compromised, encryption is worthless.
Ransomware consideration: Attackers target backup systems specifically. Encrypted, immutable (WORM — Write Once Read Many), offsite backups are the last line of defense.
Key escrow for backups: If the person who manages encryption keys leaves, you need a way to decrypt backups. Key escrow or split-key arrangements prevent crypto-lockout.

Testing

Backups that aren’t tested are assumptions, not backups. Regular restore tests verify:

Data integrity (restored data matches original)
Recovery time (meets RTORecovery Time Objective — Maximum acceptable downtime)
Process documentation accuracy

Replication

Synchronous vs. Asynchronous

Type	How It Works	RPO	Latency Impact	Use Case
Synchronous	Write is confirmed only after both primary AND replica acknowledge	Zero — no data loss	Higher (write waits for remote confirmation)	Financial transactions, databases where any data loss is unacceptable
Asynchronous	Write is confirmed after primary acknowledges; replica updates afterward	Non-zero — lag between primary and replica	Lower (write doesn’t wait)	Cross-region replication, backup sites, most general workloads

Exam tip: If the question describes zero RPORecovery Point Objective — Maximum acceptable data loss (in time), the answer involves synchronous replication. If the question mentions geographic distance (cross-region, cross-country), asynchronous is more practical because synchronous replication over long distances introduces unacceptable latency.

Database Recovery

Point-in-Time Recovery (PITR)

Restore a database to any specific moment in time, not just the last backup:

Combines a full backup + transaction logs (replay transactions up to the desired timestamp)
Essential for recovering from data corruption or accidental deletion (“undo the DROP TABLE that happened at 2:43 PM”)
Most enterprise databases support PITR: PostgreSQL, MySQL, Oracle, SQLStructured Query Language — Language for database queries Server

Write-Ahead Logging (WAL)

All database changes are written to a log before being applied to the data files
If the database crashes, it replays the WAL to recover committed transactions that weren’t yet written to disk
WAL files can be archived and shipped to a replica for continuous replication (log shipping)
PostgreSQL example: WAL archiving + PITR enables continuous backup with near-zero RPORecovery Point Objective — Maximum acceptable data loss (in time)

Transaction Logs

Sequential record of every database transaction (inserts, updates, deletes)
Used for: crash recovery (replay committed transactions), replication (ship to replicas), auditing (who changed what when), PITR (replay to specific point)
Retention: Transaction logs can grow very large. Retention policy must balance recovery capability against storage cost.

Journaling

File system journaling records metadata changes (and optionally data changes) before they’re committed to the file system:

If the system crashes mid-write, the journal is replayed on recovery to bring the file system to a consistent state
Prevents file system corruption — without journaling, a crash during a write operation could leave the file system in an inconsistent state requiring a full check (fsck)
Modern file systems use journaling by default: ext4 (Linux), NTFSNew Technology File System — Windows file system supporting permissions and encryption (Windows), APFS (macOS)
Metadata-only journaling: Faster, protects file system structure but not file contents
Full journaling: Slower, protects both metadata and file data

Exam context: Journaling is a resilience mechanism for storage systems, not a backup mechanism. It prevents corruption from unexpected shutdowns, not from deliberate attacks or hardware failure.

Recovery Sites

Hot Site

Fully equipped, real-time data replication, ready to take over immediately.

RTORecovery Time Objective — Maximum acceptable downtime: Minutes to hours
Most expensive. Maintained continuously with live data sync.

Warm Site

Equipment installed but not fully configured. Data must be restored from backups.

RTORecovery Time Objective — Maximum acceptable downtime: Hours to days
Balance between cost and recovery speed

Cold Site

Empty facility with power, cooling, and network connectivity. No equipment pre-installed.

RTORecovery Time Objective — Maximum acceptable downtime: Days to weeks
Cheapest. Equipment must be procured and configured after activation.

Cloud-Based Recovery

DRaaSDisaster Recovery as a Service — Cloud-hosted disaster recovery environment (Disaster Recovery as a Service): Cloud provider hosts your recovery environment
Scales on demand — pay for standby capacity, spin up full environment during disaster
Eliminates physical site management

Selection Decision Logic

When to Choose Which Backup Type

If the scenario says…	Choose…	Because…
”Fastest possible restore”	Full	Single backup to restore, no chaining
”Minimize backup window / fastest backup”	Incremental	Only changes since last backup (any type)
“Balance backup speed and restore speed”	Differential	Only changes since last full, faster restore than incremental
”Instant rollback of a VM”	Snapshot	Point-in-time copy, seconds to create
”RPO of zero”	Continuous replication	Not a traditional backup — real-time sync

When to Choose Which Recovery Site

If the scenario says…	Choose…	Because…
”RTO of minutes” or “zero downtime”	Hot site	Live data, ready to go immediately
”RTO of hours” or “cost-conscious with reasonable recovery”	Warm site	Equipment ready, data needs restoration
”RTO of days/weeks is acceptable” or “minimal budget”	Cold site	Facility only, everything must be provisioned
”Elastic scaling during disaster”	Cloud DR (DRaaSDisaster Recovery as a Service — Cloud-hosted disaster recovery environment)	Pay for what you use, scales on demand

Recovery Site Comparison

Attribute	Hot	Warm	Cold	Cloud DR
RTORecovery Time Objective — Maximum acceptable downtime	Minutes	Hours	Days-weeks	Minutes-hours
Cost	Highest	Moderate	Lowest	Variable (usage-based)
Data currency	Real-time	Last backup	Last backup	Configurable
Maintenance	Continuous	Periodic	Minimal	Provider-managed
Best for	Mission-critical (finance, healthcare)	Important but not real-time critical	Budget-constrained, low-priority systems	Organizations already in cloud

Recovery Metrics

RTO (Recovery Time Objective)

Maximum acceptable time to restore operations after a disruption.

“We must be back online within 4 hours.”
Drives decisions about recovery site type, backup strategy, and automation investment.

RPO (Recovery Point Objective)

Maximum acceptable amount of data loss measured in time.

“We can afford to lose no more than 1 hour of data.”
RPORecovery Point Objective — Maximum acceptable data loss (in time) of 1 hour → backups/replication must happen at least hourly.
RPORecovery Point Objective — Maximum acceptable data loss (in time) of 0 → requires real-time synchronous replication.

MTBF (Mean Time Between Failures)

Average time a system operates before failing. Measure of reliability.

Higher MTBFMean Time Between Failures — Average uptime between failures = more reliable system
Used to predict failure frequency and plan maintenance

MTTR (Mean Time to Repair)

Average time to restore a system after failure. Measure of maintainability.

Lower MTTRMean Time to Repair — Average time to restore after failure = faster recovery
Improved by automation, documentation, spare parts availability

Power Resilience

UPS (Uninterruptible Power Supply)

Battery backup that provides immediate power during outages.

Bridges the gap between power loss and generator startup (typically 10-30 seconds)
Also provides power conditioning (surge protection, voltage regulation)

Generator

Diesel or natural gas backup power for extended outages.

Requires fuel supply and regular testing
Automatic transfer switch (ATS) manages failover from utility to generator

PDU (Power Distribution Unit)

Distributes power to rack-mounted equipment. Managed PDUs allow remote power cycling and monitoring.

Capacity Planning

Ensuring infrastructure can handle current and future demand:

People: Sufficient staff for security operations, incident response, recovery
Technology: Processing capacity, storage, network bandwidth
Infrastructure: Data center space, power, cooling

Underprovisioning creates availability risk. Overprovisioning wastes resources. Capacity planning balances both against business requirements and growth projections.

Offensive Context

Resilience is what prevents a successful attack from becoming a catastrophe. An attacker who detonates ransomware against an org with tested backups, a warm site, and a 4-hour RTORecovery Time Objective — Maximum acceptable downtime has caused an inconvenience. The same attack against an org with untested backups and no recovery site is an existential threat. Attackers increasingly target backup systems specifically (deleting shadow copies, encrypting backup servers, compromising cloud backup credentials) because they know destroying recovery capability maximizes leverage. Your backup architecture must assume the attacker will try to destroy it.