Lesson 40

Reliability & Trusted Advisor

High availability concepts, fault tolerance, and AWS Trusted Advisor

Reliability & Availability Concepts

Reliability

Reliability is the probability that an entire system will function as intended for a specified period. It's measured using MTBF (Mean Time Between Failures) = MTTF + MTTR.

  • MTTF: Mean Time To Failure — how long the system runs before failing.
  • MTTR: Mean Time To Repair — how long it takes to diagnose and fix the failure.

Availability

Availability = normal operation time / total time. It's expressed as a percentage of uptime over a period (commonly 1 year). The common shorthand is "number of 9s":

AvailabilityMax Downtime Per YearExample
99% (two 9s)~3.65 daysInternal tools
99.9% (three 9s)~8.76 hoursBusiness applications
99.99% (four 9s)~52.56 minutesEnterprise SaaS
99.999% (five 9s)~5.26 minutesMission-critical systems

High Availability

A highly available system can withstand degradation while remaining available. Downtime is minimized, and minimal human intervention is needed. Services are restored rapidly, often in less than 1 minute.

Three Factors That Influence Availability

FactorDescription
Fault ToleranceBuilt-in redundancy of components. System remains operational even if some components fail. Relies on specialized hardware for instant failover. Does NOT address software failures (the most common cause of downtime).
ScalabilityAbility to accommodate increases in capacity without changing design. Contributes to availability but doesn't guarantee it.
RecoverabilityPolicies and procedures related to restoring service after a catastrophic event. Ability to restore quickly with no data loss.
Cost tradeoff: Improving availability usually increases cost. Balance the cost of improvement with the benefit to users. Decide whether "always reachable" or "servicing requests within acceptable performance" is the goal.

AWS Trusted Advisor

An online tool that provides real-time guidance to help you provision resources following AWS best practices. It examines your entire AWS environment and gives recommendations in five categories:

CategoryWhat It Checks
Cost OptimizationUnused/idle resources; opportunities to commit to reserved capacity; potential monthly savings
PerformanceService limits; provisioned throughput utilization; overutilized instances
SecurityIAM settings (MFA on root, password policy); security group rules with unrestricted access; S3 bucket permissions; enabling AWS security features
Fault ToleranceAuto Scaling configuration; health checks; Multi-AZ deployments; backup capabilities (EBS snapshots, S3 bucket logging)
Service LimitsUsage exceeding 80% of the service limit (snapshot-based; changes can take up to 24 hours to reflect)

Trusted Advisor Access by Support Plan

PlanChecks Available
Basic & Developer6 core checks (security and service limits)
Business & EnterpriseAll checks (full suite across all 5 categories)
Exam Tip: "Which AWS service provides security/performance/cost/fault tolerance optimization recommendations?" → AWS Trusted Advisor. "Provide" + "recommendations" = Trusted Advisor. It does NOT make changes — it only recommends.

Reliability & Trusted Advisor Quiz

Select one answer per question. You will receive immediate feedback.

1. How is Mean Time Between Failures (MTBF) calculated?
2. A system has 99.999% availability. What is the maximum downtime this system can experience per year?
3. Which factor influencing availability refers to the built-in redundancy of components and the ability to remain operational even if some components fail?
4. Which AWS tool would alert you that MFA is not enabled on your root account?
5. Which Trusted Advisor category checks whether you have unused EC2 instances or idle load balancers?
Progress: 0/5 correct (0%). Answer all questions to see the final recommendation.
Primary Source: AWS Academy Module 9: Cloud Architecture (module-9.txt) — Sections 2-3: Reliability and High Availability, AWS Trusted Advisor.
Last updated: June, 2026© 2026 Shahriar Ahmed ShovonCredits