Reliability & Trusted Advisor
High availability concepts, fault tolerance, and AWS Trusted Advisor
Reliability & Availability Concepts
Reliability
Reliability is the probability that an entire system will function as intended for a specified period. It's measured using MTBF (Mean Time Between Failures) = MTTF + MTTR.
- MTTF: Mean Time To Failure — how long the system runs before failing.
- MTTR: Mean Time To Repair — how long it takes to diagnose and fix the failure.
Availability
Availability = normal operation time / total time. It's expressed as a percentage of uptime over a period (commonly 1 year). The common shorthand is "number of 9s":
| Availability | Max Downtime Per Year | Example |
|---|---|---|
| 99% (two 9s) | ~3.65 days | Internal tools |
| 99.9% (three 9s) | ~8.76 hours | Business applications |
| 99.99% (four 9s) | ~52.56 minutes | Enterprise SaaS |
| 99.999% (five 9s) | ~5.26 minutes | Mission-critical systems |
High Availability
A highly available system can withstand degradation while remaining available. Downtime is minimized, and minimal human intervention is needed. Services are restored rapidly, often in less than 1 minute.
Three Factors That Influence Availability
| Factor | Description |
|---|---|
| Fault Tolerance | Built-in redundancy of components. System remains operational even if some components fail. Relies on specialized hardware for instant failover. Does NOT address software failures (the most common cause of downtime). |
| Scalability | Ability to accommodate increases in capacity without changing design. Contributes to availability but doesn't guarantee it. |
| Recoverability | Policies and procedures related to restoring service after a catastrophic event. Ability to restore quickly with no data loss. |
AWS Trusted Advisor
An online tool that provides real-time guidance to help you provision resources following AWS best practices. It examines your entire AWS environment and gives recommendations in five categories:
| Category | What It Checks |
|---|---|
| Cost Optimization | Unused/idle resources; opportunities to commit to reserved capacity; potential monthly savings |
| Performance | Service limits; provisioned throughput utilization; overutilized instances |
| Security | IAM settings (MFA on root, password policy); security group rules with unrestricted access; S3 bucket permissions; enabling AWS security features |
| Fault Tolerance | Auto Scaling configuration; health checks; Multi-AZ deployments; backup capabilities (EBS snapshots, S3 bucket logging) |
| Service Limits | Usage exceeding 80% of the service limit (snapshot-based; changes can take up to 24 hours to reflect) |
Trusted Advisor Access by Support Plan
| Plan | Checks Available |
|---|---|
| Basic & Developer | 6 core checks (security and service limits) |
| Business & Enterprise | All checks (full suite across all 5 categories) |
Reliability & Trusted Advisor Quiz
Select one answer per question. You will receive immediate feedback.