🔵

Google Cloud Platform Advanced

Architect on Google Cloud: HA and multi-region, security, hybrid networking, BigQuery and data, Pub/Sub and cost control.

22 lessons 66 quiz questions
Lessons & quizzes Certificate

📚 Lessons & quizzes

Each lesson ends with its own short quiz. Answer them as you go — score 90% across all lessons to earn your certificate.

1 The Google Cloud Architecture Framework

The Google Cloud Architecture Framework is Google’s set of best-practice guidance for designing and operating workloads on Google Cloud. It is organised around six core principles (pillars):

  • Operational excellence — efficiently deploy, operate, monitor and manage workloads.
  • Security, privacy and compliance — maximise the security of data and workloads, design for compliance.
  • Reliability — design resilient, highly available systems and plan for recovery.
  • Cost optimisation — maximise business value for what you spend.
  • Performance optimisation — design and tune resources for performance.
  • Sustainability — reduce the environmental impact of your cloud footprint.

The framework complements but is distinct from the Cloud Adoption Framework (organisational maturity) and the Well-Architected review process. Architects use it as a checklist when reviewing a design: each decision should be traceable to one or more pillars, and trade-offs (for example cost versus reliability) should be explicit rather than accidental.

A healthy architecture treats these pillars as a system: increasing reliability with multi-region replication raises cost, so the framework pushes you to align the level of investment with the workload’s actual business requirements rather than over-engineering everything.

2 High availability with regional and multi-zonal resources

A Google Cloud region is a specific geographic location made up of three or more isolated zones. A zone is a deployment area within a region with its own power, cooling and networking, engineered so that a failure in one zone does not affect the others.

To build high availability (HA), distribute resources across multiple zones in a region so that the loss of a single zone does not take down the workload:

  • Regional Managed Instance Groups (MIGs) spread Compute Engine VMs across zones and recreate failed instances automatically.
  • Regional Persistent Disks synchronously replicate block storage across two zones in a region.
  • Cloud SQL with HA uses a regional configuration with a standby in another zone and automatic failover.
  • GKE regional clusters place the control plane and nodes across three zones.

A regional internal/external load balancer in front of a regional MIG removes the single point of failure: health checks route traffic away from unhealthy instances and zones automatically. The key principle is to avoid zonal single points of failure for anything that needs to stay up.

# Create a regional MIG spread across all zones in the region
gcloud compute instance-groups managed create web-mig \
  --template=web-template \
  --size=3 \
  --region=europe-north1 \
  --target-distribution-shape=EVEN

3 Multi-region architecture and disaster recovery (RTO/RPO)

Regional HA protects against zone failure; disaster recovery (DR) protects against the loss of an entire region. Two metrics drive every DR design:

  • RTO (Recovery Time Objective) — the maximum acceptable time to restore service after an incident.
  • RPO (Recovery Point Objective) — the maximum acceptable amount of data loss, measured in time (how far back the last good copy can be).

Common DR patterns trade cost against RTO/RPO:

  • Backup and restore — cheapest, highest RTO/RPO; restore from backups in a new region.
  • Cold standby — infrastructure defined as code, spun up on disaster.
  • Warm standby — a scaled-down copy runs continuously and is scaled up on failover.
  • Hot standby / active-active — full capacity runs in two regions, giving near-zero RTO/RPO at the highest cost.

Multi-region services help: Cloud Storage multi-region buckets, Spanner (multi-region configurations with strong consistency), and multi-region BigQuery datasets reduce the data-replication burden. Always test failover regularly — an untested DR plan is an assumption, not a capability.

4 Security best practices and defense in depth

Defense in depth layers multiple independent controls so that the failure of any single control does not compromise the system. On Google Cloud the layers include identity, network, application and data controls.

Foundational best practices:

  • Least privilege — grant the minimum IAM roles needed; prefer predefined or custom roles over broad basic roles like Owner/Editor.
  • Resource hierarchy — organise resources into Organization → Folders → Projects and apply policy at the right level.
  • Service accounts — use them for workloads, avoid downloaded keys, and prefer short-lived credentials.
  • Encryption by default — data is encrypted at rest and in transit automatically; add CMEK for control.
  • Security Command Center — centralised visibility into misconfigurations, vulnerabilities and threats.

The BeyondCorp zero-trust model is central to Google’s approach: access decisions are based on identity and device context rather than network location, so being inside the corporate network grants no implicit trust. Combine these layers so that a single misconfiguration is contained, not catastrophic.

5 Cloud KMS and customer-managed encryption keys (CMEK)

By default Google Cloud encrypts all data at rest with Google-managed keys — you do nothing and it is transparent. When you need control over the key lifecycle, use Cloud Key Management Service (Cloud KMS).

Key concepts:

  • CMEK (Customer-Managed Encryption Keys) — you create and manage keys in Cloud KMS, and services such as Cloud Storage, BigQuery, Persistent Disk and Cloud SQL use them to encrypt your data. You control rotation, disabling and destruction.
  • CSEK (Customer-Supplied Encryption Keys) — you supply the raw key material with each request; Google never stores it.
  • Key ring and key hierarchy — keys live in key rings scoped to a location; a key has versions, and rotation creates a new version.
  • Cloud HSM — keys backed by FIPS 140-2 Level 3 hardware security modules.
  • Cloud External Key Manager (EKM) — keys managed outside Google Cloud in a third-party key manager.

With CMEK, disabling or destroying a key effectively renders the protected data unreadable — a powerful but dangerous control, so guard key-admin permissions (cloudkms.admin) carefully and separate them from data access (duty separation).

# Create a key ring and a rotating symmetric key for CMEK
gcloud kms keyrings create app-keyring --location=europe-north1

gcloud kms keys create data-key \
  --location=europe-north1 \
  --keyring=app-keyring \
  --purpose=encryption \
  --rotation-period=90d \
  --next-rotation-time=2026-09-01T00:00:00Z

6 Network security: Cloud Armor and Cloud Firewall

Protecting workloads at the network edge and inside the VPC uses two main services.

Google Cloud Armor is a web application firewall (WAF) and DDoS protection service that attaches to global external load balancers:

  • Define security policies with allow/deny rules based on IP, geography or expressions.
  • Use preconfigured WAF rules (based on the OWASP ModSecurity Core Rule Set) to block SQL injection, XSS and other attacks.
  • Adaptive Protection uses machine learning to detect and mitigate Layer 7 DDoS attacks.
  • Rate limiting throttles or bans clients exceeding a request threshold.

Cloud Firewall controls traffic to and from VM instances within the VPC:

  • VPC firewall rules and hierarchical firewall policies (at org/folder level) apply allow/deny by direction, IP range, protocol/port and target tags or service accounts.
  • Default rules deny ingress and allow egress; you add explicit rules for your traffic.

Cloud Armor protects Layer 7 at the edge; Cloud Firewall enforces Layer 3/4 segmentation inside the network — use them together.

7 VPC Service Controls and Private Service Connect

Two services address data exfiltration and private access to Google APIs.

VPC Service Controls create a service perimeter around Google-managed services (such as Cloud Storage, BigQuery and Bigtable) to mitigate data exfiltration:

  • Resources inside the perimeter cannot be accessed from outside it, even by an identity with valid IAM permissions.
  • Access levels (via Access Context Manager) allow conditional access based on IP, device or identity.
  • Ingress/egress rules and perimeter bridges permit controlled cross-perimeter access.

Private Service Connect (PSC) lets you consume services privately using internal IP addresses inside your VPC, without traversing the public internet:

  • Reach Google APIs (a PSC endpoint for storage.googleapis.com, for example) over a private IP.
  • Connect to a producer service (your own or a partner’s) published behind a service attachment.

VPC Service Controls is about blocking exfiltration; Private Service Connect is about private connectivity. They are complementary, not alternatives.

8 Hybrid connectivity: Cloud Interconnect and Cloud VPN

Connecting an on-premises network or another cloud to Google Cloud uses two main families of products.

Cloud VPN establishes IPsec tunnels over the public internet:

  • HA VPN provides a 99.99% SLA using two tunnels on separate interfaces; Classic VPN offers a 99.9% SLA.
  • Uses Cloud Router for dynamic (BGP) route exchange.
  • Throughput is limited per tunnel and traffic is encrypted but crosses the internet.

Cloud Interconnect provides dedicated, private, high-bandwidth links:

  • Dedicated Interconnect — a direct physical connection (10 or 100 Gbps) into Google’s network at a colocation facility.
  • Partner Interconnect — connectivity through a supported service provider, useful when you cannot reach a colocation facility or need lower bandwidth.
  • Traffic does not traverse the public internet, offering lower latency and higher, more consistent bandwidth.

Choose VPN for quick, lower-bandwidth, internet-based connectivity, and Interconnect when you need high, predictable bandwidth and private transport.

9 Network Connectivity Center

Network Connectivity Center (NCC) is a hub-and-spoke model for managing connectivity across your Google Cloud and on-premises networks from a single place.

  • A hub is a central management resource.
  • Spokes attach to the hub and represent connectivity resources: HA VPN tunnels, Cloud Interconnect (VLAN) attachments, router appliances, and VPC spokes.
  • Spokes can exchange routes through the hub, enabling site-to-site data transfer — for example connecting two branch offices via Google’s backbone.

NCC simplifies large topologies: instead of building and maintaining a mesh of point-to-point connections and route exchanges, you connect each network to the hub. It is particularly valuable for organisations with many branch sites, multiple VPCs, or a need to use Google’s global backbone for inter-site transit. VPC spokes allow many VPCs to communicate without a full mesh of VPC peerings.

10 BigQuery: the serverless data warehouse

BigQuery is Google’s fully managed, serverless, petabyte-scale data warehouse. It separates storage from compute, so each scales independently and you pay for them separately.

  • Columnar storage and a distributed query engine (Dremil-based) make analytical SQL over huge datasets fast.
  • On-demand pricing charges by bytes scanned; capacity (editions/slots) pricing reserves dedicated compute.
  • Partitioning (by date or integer range) and clustering reduce bytes scanned and therefore cost and latency.
  • BigQuery BI Engine caches data in memory for sub-second dashboards.
  • BigQuery ML trains and serves machine-learning models using SQL.
  • BigQuery Omni queries data in other clouds; external tables query data in Cloud Storage without loading it.

To control cost on on-demand pricing, design tables to scan less: select only the columns you need (avoid SELECT *), filter on partition columns, and cluster on frequently filtered fields. Bytes scanned, not rows returned, drives the bill.

-- Query that prunes by partition and selects only needed columns
SELECT user_id, SUM(amount) AS total
FROM `proj.sales.orders`
WHERE order_date BETWEEN '2026-06-01' AND '2026-06-26'
GROUP BY user_id;

11 Data processing pipelines with Dataflow

Dataflow is Google’s fully managed, serverless service for executing Apache Beam pipelines. A single Beam pipeline can run in both batch and streaming modes — the same code, different bounded/unbounded sources.

  • Unified model — Beam’s programming model handles bounded (batch) and unbounded (streaming) data with the same transforms (ParDo, GroupByKey, windowing).
  • Windowing and triggers — group streaming data into fixed, sliding or session windows, with watermarks and triggers to handle late data.
  • Autoscaling and dynamic work rebalancing — Dataflow adjusts worker count and redistributes work to eliminate stragglers.
  • Templates — Google-provided and custom templates let you launch pipelines without compiling code each time.

A classic streaming pattern is Pub/Sub → Dataflow → BigQuery: events land in Pub/Sub, Dataflow transforms and enriches them in flight, and results stream into BigQuery for analytics. Dataflow’s exactly-once processing semantics make it reliable for financial and event data where duplicates matter.

12 Event-driven architecture: Pub/Sub, Eventarc and Workflows

Event-driven systems decouple producers from consumers, improving scalability and resilience. Three services work together on Google Cloud.

Pub/Sub is a global, fully managed messaging service:

  • Topics receive messages; subscriptions deliver them (push or pull).
  • At-least-once delivery by default; exactly-once delivery is available within a region.
  • Dead-letter topics capture messages that repeatedly fail processing; message ordering can be enabled with ordering keys.

Eventarc routes events from Google Cloud sources (and third parties) to destinations such as Cloud Run, GKE and Workflows, using a standard CloudEvents format. It can trigger on Cloud Audit Logs events, letting you react to almost any Google Cloud action.

Workflows is a serverless orchestrator that connects services in a defined sequence with conditionals, retries and error handling, expressed in YAML/JSON. Use Workflows for orchestration (a known sequence of steps) and Pub/Sub for choreography (loosely coupled event reactions).

13 Caching with Memorystore for Redis

Memorystore provides fully managed, in-memory data stores. Memorystore for Redis (and Valkey) and for Memcached remove the operational burden of running a cache cluster.

  • Sub-millisecond latency for cache hits, offloading read pressure from databases.
  • Tiers — the Basic tier is a single node (no replication); the Standard tier adds a replica in another zone with automatic failover for HA.
  • Read replicas scale read throughput; Cluster mode shards data across nodes for larger capacity and throughput.
  • Accessed over private IP within your VPC, never exposed to the public internet by default.

Common patterns: cache-aside (the app checks the cache, falls back to the database on a miss, then populates the cache), session storage, rate-limiting counters, and leaderboards. For HA you must use the Standard tier — the Basic tier has no failover, so a node failure loses the cache contents.

14 API management with Apigee and API Gateway

Exposing services as managed APIs uses two products with different scopes.

Apigee is a full-lifecycle, enterprise API management platform:

  • API proxies decouple the consumer-facing API from backend implementation.
  • Policies add security (OAuth, API keys, JWT), traffic management (quotas, spike arrest), and transformation (mediation between JSON/XML).
  • Developer portal, monetisation and analytics support API products and partner ecosystems.

API Gateway is a lighter-weight, serverless option:

  • Fronts backends such as Cloud Functions, Cloud Run and App Engine.
  • Configured with an OpenAPI specification; handles authentication (API keys, service accounts, JWT) and basic routing.

Choose Apigee for rich enterprise needs — monetisation, advanced policies, developer ecosystems — and API Gateway for simple, cost-effective gateways in front of serverless backends.

15 Microservices on GKE: Autopilot and Anthos

Google Kubernetes Engine (GKE) is the managed Kubernetes platform for running containerised microservices.

GKE offers two modes of operation:

  • Standard — you manage and pay for nodes, with full control over node configuration.
  • Autopilot — Google manages nodes for you; you pay per Pod resource request, and security/best-practice defaults are enforced. It reduces operational overhead and is the recommended default for most workloads.

Key building blocks for microservices:

  • Horizontal Pod Autoscaler (HPA) and Cluster Autoscaler / node auto-provisioning scale Pods and nodes by demand.
  • Workload Identity binds Kubernetes service accounts to Google service accounts for secure, keyless access to Google APIs.
  • GKE Gateway / Ingress integrates with global load balancing.

Anthos / GKE Enterprise extends GKE to hybrid and multi-cloud, adding a managed service mesh (Istio-based), Config Management (policy as code via GitOps) and fleet management across clusters running anywhere.

16 Observability at scale with Cloud Operations

The Google Cloud Operations suite (formerly Stackdriver) provides integrated observability across metrics, logs and traces.

  • Cloud Monitoring — collects metrics, builds dashboards, and defines alerting policies. SLOs and error budgets can be tracked against service-level indicators.
  • Cloud Logging — centralises logs with powerful queries; log sinks route logs to Cloud Storage, BigQuery or Pub/Sub; log-based metrics derive metrics from log content.
  • Cloud Trace — distributed tracing shows latency across microservice calls, helping find bottlenecks.
  • Cloud Profiler — continuous, low-overhead CPU and memory profiling of production code.
  • Error Reporting — aggregates and surfaces application errors.

At scale, the goal is to observe the system through the lens of the four golden signals — latency, traffic, errors and saturation. Combine traces (where time is spent), metrics (trends and alerts) and logs (root-cause detail) rather than relying on any single signal.

17 Governance with Organization Policy and folders

Governance enforces guardrails across many projects so teams move fast without violating security or compliance.

The resource hierarchyOrganization → Folders → Projects → resources — is the backbone:

  • Folders group projects (for example by department or environment) and let you apply IAM and policy at a level above individual projects.
  • IAM and policies are inherited down the hierarchy.

Organization Policy Service sets constraints that restrict what configurations are allowed, regardless of IAM permissions:

  • Examples: restrict resource locations (data residency), disable service account key creation, require OS Login, restrict allowed external IPs, restrict which services can be enabled.
  • Constraints can be set at organization, folder or project level and inherited.

Whereas IAM answers who can do what, Organization Policy answers what is allowed at all. Together with hierarchical firewall policies and VPC Service Controls they form a layered governance model. Tools like the Cloud Foundation Toolkit codify these guardrails as infrastructure as code.

18 Cost optimisation: committed-use discounts, Spot VMs and rightsizing

Cost optimisation is a first-class architecture concern. Google Cloud offers several levers.

  • Committed Use Discounts (CUDs) — commit to a steady amount of usage for 1 or 3 years for a significant discount. Resource-based CUDs apply to specific vCPU/memory; spend-based CUDs apply to a dollar amount on certain services.
  • Sustained Use Discounts — automatic discounts for running certain VMs for a large portion of the month, applied with no commitment.
  • Spot VMs (and Preemptible VMs) — heavily discounted, but can be reclaimed by Google at any time; ideal for fault-tolerant, batch and stateless workloads.
  • Rightsizing recommendations — the Recommender analyses usage and suggests resizing or removing idle/over-provisioned resources.
  • Budgets and alerts, and billing export to BigQuery — monitor and analyse spend.

A practical strategy: cover a steady baseline with CUDs, run elastic and interruptible work on Spot VMs and autoscaling, and act on rightsizing recommendations regularly. Sustained use discounts apply automatically, so they require no action.

19 Identity federation and Workload Identity Federation

Avoiding long-lived service account keys is a major security goal. Several federation mechanisms enable keyless, short-lived access.

  • Workforce Identity Federation lets human users from an external identity provider (Okta, Azure AD/Entra ID, any OIDC/SAML IdP) access Google Cloud without creating Google accounts.
  • Workload Identity Federation lets workloads running outside Google Cloud (on AWS, Azure, on-prem, or in CI/CD systems like GitHub Actions) impersonate a Google service account using their own native credentials — no exported keys needed.
  • Workload Identity (for GKE) binds Kubernetes service accounts to Google service accounts so Pods access Google APIs without keys.

The mechanism uses a workload identity pool and a provider that trusts the external IdP. The external credential (an OIDC token, for example) is exchanged via the Security Token Service for a short-lived Google access token. Because nothing is a downloadable key, the main exfiltration risk of static credentials is eliminated. Prefer federation over service account keys wherever possible.

20 Autoscaling architectures

Autoscaling matches capacity to demand, improving both reliability and cost. Google Cloud autoscales at several layers.

  • Managed Instance Groups (MIGs) autoscale Compute Engine VMs based on CPU utilisation, load-balancing capacity, Cloud Monitoring metrics, or schedules. Predictive autoscaling can add capacity ahead of forecast demand.
  • GKE uses the Horizontal Pod Autoscaler (more Pods), the Vertical Pod Autoscaler (right-size Pod requests), and the Cluster Autoscaler / node auto-provisioning (more nodes).
  • Cloud Run scales container instances automatically from zero to many based on concurrent requests, with configurable min/max instances.
  • Serverless services (Cloud Functions, App Engine) scale on demand inherently.

Design considerations: keep instances stateless so any instance can serve any request; externalise state to databases or caches; set sensible min instances to avoid cold-start latency; and cap max instances to protect downstream dependencies from being overwhelmed during a spike.

21 Serverless patterns: Cloud Run with Pub/Sub

Cloud Run runs stateless containers that scale automatically, including down to zero. Combined with Pub/Sub it enables robust, decoupled, event-driven processing.

The canonical pattern:

  • A producer publishes events to a Pub/Sub topic.
  • A push subscription delivers each message as an HTTP POST to a Cloud Run service, which scales up with the message rate.
  • Cloud Run processes the message and returns a success status; on failure the message is retried, and after repeated failures it goes to a dead-letter topic.

Important design points:

  • Make handlers idempotent — Pub/Sub guarantees at-least-once delivery, so duplicates can occur.
  • Use a dedicated service account with least privilege for the push subscription.
  • Tune concurrency (requests per instance) and acknowledgement deadlines for long-running work.
  • Cloud Run also supports Eventarc triggers for a broader range of event sources.

This pattern absorbs spikes gracefully: Pub/Sub buffers the backlog while Cloud Run scales out, then both settle as the load subsides.

# Create a push subscription that delivers to a Cloud Run service
gcloud pubsub subscriptions create orders-sub \
  --topic=orders \
  --push-endpoint=$(gcloud run services describe order-worker --region=europe-north1 --format='value(status.url)') \
  --push-auth-service-account=run-invoker@my-proj.iam.gserviceaccount.com \
  --dead-letter-topic=orders-dead-letter \
  --max-delivery-attempts=5

22 Global load balancing and Cloud CDN

Serving a global audience with low latency and high availability relies on Google’s edge network.

The global external Application Load Balancer provides a single anycast IP address that routes users to the nearest healthy backend:

  • Traffic enters Google’s network at the nearest edge point of presence and travels over Google’s private backbone to the backend region.
  • Backend services can span multiple regions; the load balancer directs requests to the closest region with capacity and fails over automatically.
  • It integrates with Cloud Armor (WAF/DDoS), SSL/TLS termination with managed certificates, and URL maps for path-based routing.

Cloud CDN caches content at Google’s edge locations:

  • Enabled on a backend of the load balancer; cacheable responses are served from the edge close to users, reducing latency and backend load.
  • Cache behaviour is controlled by cache modes and HTTP Cache-Control headers; cache invalidation purges stale content.

Together, a global load balancer plus Cloud CDN deliver static and dynamic content fast worldwide, while Cloud Armor protects the same entry point.

🎓 Certificate of Completion

🔒 Complete every lesson quiz above with 90%+ to unlock your downloadable certificate.