🟧

Amazon Web Services Advanced

Architect on AWS: multi-AZ/Region HA, security, hybrid networking, data and messaging, caching and cost optimisation.

20 lessons 60 quiz questions

📚 Lessons & quizzes

Each lesson ends with its own short quiz. Answer them as you go — score 90% across all lessons to earn your certificate.

1 The AWS Well-Architected Framework: six pillars

The Well-Architected Framework is a set of design principles and questions AWS uses to evaluate workloads. It is organised into six pillars, each describing a dimension of a healthy cloud architecture.

Operational Excellence — run and monitor systems, and continuously improve processes and procedures (infrastructure as code, small reversible changes, learn from failure).
Security — protect data, systems and assets through identity, detective controls, infrastructure protection and incident response.
Reliability — recover from failures, scale to meet demand, and mitigate disruptions automatically.
Performance Efficiency — use computing resources efficiently and keep doing so as demand and technology change.
Cost Optimisation — avoid unnecessary costs and pay only for what you need.
Sustainability — minimise the environmental impact of running cloud workloads (the sixth pillar, added in 2021).

The framework is supported by the Well-Architected Tool in the console, which lets you review a workload against the pillars and produce a prioritised list of improvements. It is a guidance and review process, not a certification gate.

2 Multi-AZ vs Multi-Region high availability

AWS organises its infrastructure into Regions (independent geographic areas) and Availability Zones (AZs) — one or more discrete data centres within a Region, each with redundant power, networking and connectivity, physically separated but linked by low-latency private fibre.

Multi-AZ deployment spreads a workload across two or more AZs in the same Region. It protects against the failure of a single data centre or AZ while keeping inter-AZ latency very low (single-digit milliseconds). This is the default recommendation for production high availability and is how Amazon RDS Multi-AZ, ELB and Auto Scaling groups operate.

Multi-Region deployment spreads a workload across separate Regions (e.g. eu-west-1 and us-east-1). It protects against a whole-Region outage and can place data closer to users, but it is more complex and costly: cross-Region replication has higher latency, and data residency and consistency must be managed carefully.

Rule of thumb: use Multi-AZ for availability within a Region, and add Multi-Region only when you need Region-level disaster recovery, very low latency for a global audience, or strict data-sovereignty separation.

3 Disaster recovery strategies and RTO/RPO

Disaster recovery (DR) planning balances cost against two objectives. RTO (Recovery Time Objective) is the maximum acceptable time to restore service after an incident. RPO (Recovery Point Objective) is the maximum acceptable amount of data loss, measured as a window of time before the incident.

AWS describes four DR strategies, from cheapest/slowest to most expensive/fastest:

Backup and restore — back up data (and infrastructure as code) to another Region; recreate on demand. Lowest cost, highest RTO (hours).
Pilot light — a minimal core (e.g. replicated database) runs in the recovery Region; the rest is provisioned when needed. Lower RTO than backup-restore.
Warm standby — a scaled-down but fully functional copy runs continuously and is scaled up on failover. Lower RTO again.
Multi-site active-active — full capacity serves traffic in multiple Regions simultaneously; failover is near-instant. Lowest RTO/RPO, highest cost.

Choose the cheapest strategy that still meets your business RTO and RPO targets — there is no benefit to over-investing in recovery speed your business does not require.

4 Security best practices and defense in depth

Defense in depth means layering independent security controls so that the failure of any single control does not compromise the whole system. On AWS the layers typically include the network edge, the VPC, the host/instance, the application and the data itself.

Core best practices follow the Security pillar:

Least privilege — grant only the permissions an identity needs, and review them with tools such as IAM Access Analyzer.
No long-lived root or static keys — protect the root user with MFA, avoid using it, and prefer temporary credentials (roles) over access keys.
Encrypt everywhere — at rest (KMS) and in transit (TLS).
Enable detective controls — CloudTrail for an audit log of API calls, GuardDuty for threat detection, AWS Config for configuration compliance.
Automate response — use EventBridge and Lambda to react to security findings automatically.

Remember the shared responsibility model: AWS secures the cloud (hardware, the global infrastructure, managed-service software), while you secure what you put in the cloud (your data, identity configuration, OS patching on EC2, and network rules).

5 KMS and envelope encryption

AWS Key Management Service (KMS) creates and controls cryptographic keys and integrates with most AWS services for encryption at rest. A KMS key (formerly Customer Master Key) never leaves KMS unencrypted; you call KMS to use it, governed by key policies and IAM.

Encrypting large objects directly with a KMS key would be slow and would send the data to KMS. Instead AWS uses envelope encryption:

KMS generates a data key and returns it both in plaintext and encrypted under the KMS key.
Your client encrypts the actual data locally with the plaintext data key, then discards the plaintext key from memory.
The encrypted data key is stored next to the ciphertext.
To decrypt, you send the encrypted data key to KMS, which returns the plaintext data key, and you decrypt the data locally.

This keeps the master key inside KMS while allowing fast local bulk encryption. Customer managed keys support rotation, key policies and CloudTrail logging; AWS managed keys are simpler but less configurable. For compliance requiring hardware isolation, CloudHSM provides dedicated hardware security modules.

# Generate a data key, then decrypt it later (envelope encryption)
aws kms generate-data-key \
  --key-id alias/my-app-key \
  --key-spec AES_256

# Returns Plaintext (use locally) and CiphertextBlob (store alongside data)
aws kms decrypt \
  --ciphertext-blob fileb://encrypted_data_key.bin \
  --key-id alias/my-app-key

6 Network security: WAF, Shield, Security Groups and NACLs

VPC traffic is filtered by two stateful/stateless mechanisms operating at different scopes:

Security Groups are stateful firewalls attached to elastic network interfaces (instances). Return traffic for an allowed request is automatically permitted. They support allow rules only.
Network ACLs (NACLs) are stateless filters attached to subnets. They evaluate rules in numbered order and support both allow and deny. Because they are stateless, you must explicitly allow both inbound and the corresponding outbound (ephemeral port) traffic.

At scale, NACLs are useful as a coarse subnet-wide deny (e.g. block a malicious CIDR), while Security Groups carry the fine-grained, instance-level policy and can reference each other by ID rather than by IP.

At the edge: AWS WAF is a web application firewall that filters HTTP(S) requests (SQL injection, cross-site scripting, rate-based rules) on CloudFront, ALB or API Gateway. AWS Shield Standard provides automatic protection against common network/transport-layer DDoS attacks at no extra cost, while Shield Advanced adds enhanced detection, 24/7 response support and cost protection.

7 PrivateLink and private service access

By default, reaching an AWS service API (like S3 or DynamoDB) or a third-party SaaS can route over the public internet. VPC endpoints keep that traffic on the AWS private network.

Gateway endpoints — route table entries for S3 and DynamoDB only. No extra cost, no ENI; they add a prefix-list route so traffic stays internal.
Interface endpoints (AWS PrivateLink) — an elastic network interface with a private IP placed in your subnet, fronting a service. Used for most other AWS services and for exposing your own services privately.

AWS PrivateLink lets a provider publish a service behind a Network Load Balancer as an endpoint service; consumers create an interface endpoint to reach it using private IPs, without VPC peering, internet gateways, NAT or overlapping-CIDR concerns. Traffic never traverses the public internet, which improves security and simplifies network design for multi-account or partner integrations.

8 Hybrid connectivity: Direct Connect, VPN and Transit Gateway

Connecting on-premises networks to AWS uses two main options, often combined:

Site-to-Site VPN — an IPsec tunnel over the public internet between your customer gateway and an AWS virtual private gateway or Transit Gateway. Quick to set up, encrypted, but subject to internet variability.
AWS Direct Connect (DX) — a dedicated private physical connection from your network into an AWS Direct Connect location. It offers consistent low latency and higher, more predictable bandwidth, and it bypasses the public internet. It is not encrypted by itself, so a VPN can run over it for encryption.

A common resilient pattern is Direct Connect as primary with a Site-to-Site VPN as backup.

As the number of VPCs grows, full-mesh VPC peering becomes unmanageable. AWS Transit Gateway acts as a regional cloud router (hub-and-spoke): VPCs, VPNs and Direct Connect gateways attach to it once, and routing between them is centralised. It scales to thousands of attachments and supports route tables for segmentation.

9 Amazon Redshift and Athena for analytics

For analytics at scale, AWS separates the data-warehouse and query-in-place patterns.

Amazon Redshift is a fully managed, petabyte-scale data warehouse. It uses columnar storage, data compression and massively parallel processing (MPP) across compute nodes to run complex analytical SQL quickly. You load and structure data into Redshift; Redshift Spectrum additionally lets you query data sitting in S3 without loading it. Distribution keys and sort keys are tuned to minimise data movement between nodes.

Amazon Athena is a serverless interactive query service that runs standard SQL directly over data in S3 using a schema defined in the AWS Glue Data Catalog. There are no servers to manage and you pay per amount of data scanned, so partitioning and columnar formats like Parquet dramatically cut cost.

Rule of thumb: use Athena for ad-hoc queries over data already in S3 with no infrastructure to run; use Redshift for sustained, high-concurrency BI workloads and complex joins where a dedicated, tuned warehouse pays off.

10 Event-driven architecture: EventBridge and Step Functions

Event-driven designs decouple producers from consumers so components scale and fail independently.

Amazon EventBridge is a serverless event bus. Producers publish events; rules match events by pattern and route them to targets (Lambda, SQS, Step Functions, other accounts). It supports a default bus, custom buses, partner SaaS event sources and a schema registry. Unlike a queue, one event can fan out to many targets via multiple matching rules.

AWS Step Functions orchestrates multi-step workflows as a state machine defined in Amazon States Language. It manages sequencing, branching (Choice), parallelism (Parallel/Map), retries and error handling, and waits — replacing brittle glue code that chains Lambdas together. Standard workflows are durable and long-running (up to a year), billed per state transition; Express workflows are high-volume, short-lived and billed per execution and duration.

Together, EventBridge routes events and Step Functions coordinates the resulting business processes reliably.

11 ElastiCache: Redis and Memcached

Amazon ElastiCache provides managed in-memory caching to offload databases and serve microsecond-latency reads. It offers two engines.

Memcached — a simple, multi-threaded key-value cache. It scales horizontally by adding nodes (sharding), but has no persistence, no replication and no advanced data structures. Best for simple, ephemeral caching where losing the cache is harmless.
Redis (ElastiCache for Redis / Valkey) — supports rich data structures (lists, sets, sorted sets, hashes), replication and Multi-AZ failover, optional persistence, pub/sub, and cluster mode for sharding. Best when you need high availability, durability or advanced features.

Two common caching strategies: lazy loading (cache-aside) populates the cache only on a miss — simple, but the first request after expiry is slow and data can be stale; write-through updates the cache on every write — data is fresh but writes are heavier and unused data may be cached. A TTL bounds staleness in either case.

12 API Gateway patterns

Amazon API Gateway is a managed front door for APIs, handling routing, throttling, authorisation, caching and request/response transformation. It supports three API types:

REST APIs — the full-featured option: request validation, mapping templates, usage plans and API keys, WAF integration and edge/regional/private endpoints.
HTTP APIs — a lower-cost, lower-latency option with fewer features, ideal for simple proxying to Lambda or HTTP backends.
WebSocket APIs — for bidirectional, real-time communication.

Key patterns include Lambda proxy integration (pass the whole request to a function), authorisers (Cognito user pools or a Lambda authoriser to validate tokens before the backend is invoked), usage plans and throttling to protect downstreams and enforce quotas, and caching responses at the gateway to cut backend load. For private APIs, deploy a private endpoint reachable only through an interface VPC endpoint.

13 Microservices on ECS and EKS

AWS offers two main container orchestrators for microservices.

Amazon ECS (Elastic Container Service) — AWS native orchestrator. Simpler to operate and deeply integrated with AWS, defined through task definitions and services.
Amazon EKS (Elastic Kubernetes Service) — managed Kubernetes. Choose it when you want the Kubernetes ecosystem, portability across clouds, or existing Kubernetes skills and tooling.

Both can run on two compute models. With EC2 launch type / node groups you manage the underlying instances (and can use Spot, GPUs, specialised sizing). With AWS Fargate you run containers serverlessly — no servers to provision, patch or scale; you pay for the vCPU and memory your tasks request. Fargate reduces operational overhead at a per-task cost premium.

For service-to-service traffic, AWS App Mesh or Kubernetes-native meshes add observability, retries and mTLS, while load balancers (ALB/NLB) and service discovery route external and internal requests.

14 Observability at scale: CloudWatch and X-Ray

Observability combines metrics, logs and traces so you can understand system behaviour.

Amazon CloudWatch Metrics — numeric time-series (CPU, latency, custom application metrics). Alarms trigger actions or notifications when a metric breaches a threshold.
CloudWatch Logs — centralised log storage; Logs Insights queries logs at scale, and metric filters turn log patterns into metrics.
CloudWatch Alarms and Dashboards — visualise and alert; composite alarms reduce noise by combining conditions.

AWS X-Ray provides distributed tracing: it follows a request across microservices, building a service map and showing where latency and errors occur. This is essential when a single user request fans out across many services and a CPU metric alone cannot tell you which hop is slow.

At scale, structure logs as JSON, emit custom metrics with embedded metric format, sample traces to control cost, and aggregate across accounts using a central monitoring account.

15 Governance: Organizations, SCPs, Config and Control Tower

Multi-account governance centralises control while isolating workloads.

AWS Organizations — groups accounts under a management account into organisational units (OUs), enables consolidated billing, and is the foundation for org-wide policies.
Service Control Policies (SCPs) — organisation-level guardrails that set the maximum permissions available to accounts/OUs. An SCP never grants access; it bounds what IAM in member accounts can allow. For example, an SCP can deny use of unapproved Regions across every account.
AWS Config — records resource configurations over time and evaluates them against rules for compliance (e.g. "no public S3 buckets"), with automated remediation.
AWS Control Tower — sets up and governs a secure multi-account landing zone using Organizations, SCP-based guardrails, centralised logging and a dashboard, applying best practices automatically.

Key mental model: SCPs limit, IAM grants. The effective permission is the intersection of the two.

16 Cost optimisation: Savings Plans, Reserved Instances, Spot and rightsizing

EC2 and related compute can be purchased under several pricing models, traded off between flexibility and discount.

On-Demand — pay per second/hour, no commitment, highest price. Good for unpredictable or short workloads.
Reserved Instances (RIs) — commit to a specific instance type/family in a Region for 1 or 3 years for a large discount. Standard RIs discount most; Convertible RIs allow changing attributes.
Savings Plans — commit to a steady amount of compute spend (\$/hour) for 1 or 3 years. Compute Savings Plans are the most flexible (apply across instance family, Region, and even Fargate/Lambda); EC2 Instance Savings Plans give a deeper discount but are tied to a family in a Region.
Spot Instances — bid on spare capacity at up to ~90% off, but AWS can reclaim them with a two-minute warning. Ideal for fault-tolerant, interruptible work (batch, CI, stateless workers).

Rightsizing — continuously match instance size to actual utilisation using Compute Optimizer and Cost Explorer recommendations. A common strategy mixes Savings Plans/RIs for the steady baseline and Spot for the elastic, interruptible layer.

17 IAM federation and IAM Identity Center (SSO)

Long-lived IAM users with passwords and access keys do not scale across many accounts. Federation lets users sign in with an existing corporate identity and assume IAM roles, receiving temporary credentials from AWS STS instead of static keys.

SAML 2.0 / OIDC federation — an external identity provider (Active Directory, Okta, Google) asserts identity; users assume a role mapped to their group.
AWS IAM Identity Center (formerly AWS SSO) — the recommended way to manage workforce access across an AWS Organization. It provides a single sign-on portal, connects to an external IdP or its own directory, and assigns permission sets (collections of policies) to users/groups per account.

The benefits are central user lifecycle management, no per-account IAM users, and short-lived credentials. For applications (not people) that need AWS identities, use IAM roles attached to compute (EC2 instance profiles, Lambda execution roles, IRSA for EKS) so no static keys are embedded in code.

18 Auto Scaling architectures

EC2 Auto Scaling keeps the right number of instances running to meet demand. An Auto Scaling group (ASG) spans multiple Availability Zones, launches instances from a launch template, and maintains a desired/min/max count.

Scaling policies:

Target tracking — keep a metric (e.g. average CPU at 50%) at a target; AWS adds/removes capacity automatically. The simplest and most common.
Step scaling — add/remove specific amounts based on alarm thresholds.
Scheduled scaling — change capacity at known times (e.g. business hours).
Predictive scaling — uses machine learning on historical patterns to provision ahead of forecast demand.

Health checks (EC2 and ELB) replace unhealthy instances automatically, giving self-healing. ASGs integrate with a load balancer so new instances are registered as targets. Combine with a mixed instances policy across On-Demand and Spot to cut cost while keeping a stable baseline. Auto Scaling also exists for ECS, DynamoDB and Aurora replicas — the same principle of matching capacity to demand.

19 Serverless patterns: Lambda at scale

AWS Lambda runs code without managing servers, scaling automatically by running more concurrent execution environments as requests arrive. At scale, several concepts matter.

Concurrency — each simultaneous invocation uses one execution environment. Reserved concurrency caps (and guarantees) a function’s share of the account limit; provisioned concurrency keeps environments pre-warmed to avoid cold starts for latency-sensitive paths.
Cold starts — the first invocation in a new environment pays initialisation time. Minimise by trimming package size, using lighter runtimes, and provisioned concurrency where needed.
Event sources — Lambda integrates with API Gateway (sync), SQS/Kinesis/DynamoDB Streams (poll-based), S3 and EventBridge (async). Async invocations retry and can route failures to a dead-letter queue or on-failure destination.

Design functions to be stateless and idempotent (the same event may be delivered more than once), keep them small and single-purpose, and push shared state to DynamoDB, S3 or ElastiCache. Watch the account-level concurrency limit so a spike in one function does not starve others.

20 Global routing: Route 53, CloudFront and Global Accelerator

Delivering a fast, resilient global experience combines DNS, a CDN and the AWS backbone.

Amazon Route 53 — managed DNS with health checks and routing policies: latency-based (send users to the lowest-latency Region), geolocation/geoproximity, weighted (split traffic, e.g. for canaries), and failover (route away from an unhealthy endpoint). DNS changes propagate within the TTL.
Amazon CloudFront — a content delivery network that caches content at edge locations close to users, reducing latency and offloading origins. It terminates TLS at the edge, integrates with WAF, and can run Lambda@Edge / CloudFront Functions for request manipulation.
AWS Global Accelerator — provides two static anycast IP addresses and routes traffic over the AWS global network backbone to the nearest healthy endpoint. Unlike CloudFront it does not cache; it improves performance and failover for non-HTTP and TCP/UDP workloads and gives near-instant Regional failover.

A typical pattern: Route 53 for DNS, CloudFront for cacheable web content, and Global Accelerator for latency-sensitive or non-cacheable global traffic.

🎓 Certificate of Completion

🔒 Complete every lesson quiz above with 90%+ to unlock your downloadable certificate.