📈

Observability & Monitoring Intermediate

Understand running systems: metrics, logs and traces, alerting, dashboards, SLOs and error budgets.

19 lessons 57 quiz questions
Lessons & quizzes Certificate

📚 Lessons & quizzes

Each lesson ends with its own short quiz. Answer them as you go — score 90% across all lessons to earn your certificate.

1 Monitoring vs Observability

Monitoring and observability are related but distinct. Monitoring watches a system against a predefined set of checks and dashboards: you decide in advance what to measure (CPU, request rate, error count) and you get alerted when those known quantities cross a threshold. It answers questions you already knew to ask.

Observability is a property of a system: how well you can understand its internal state from its external outputs (its telemetry). A highly observable system lets you ask new questions you never anticipated, without shipping new code.

The classic framing is known-unknowns versus unknown-unknowns. Monitoring excels at known-unknowns: "I know disk could fill up, so I watch free space." Observability targets unknown-unknowns: the novel, surprising failure modes of complex distributed systems that you could not have predicted in advance.

They are not opposites. Monitoring is largely a subset of what an observable system enables: you still build dashboards and alerts, but on top of rich, high-cardinality telemetry that supports open-ended exploration.

2 The Three Pillars: Metrics, Logs and Traces

Telemetry is conventionally grouped into three pillars, each answering different questions:

  • Metrics — numeric measurements aggregated over time (request rate, latency percentiles, memory used). They are cheap to store and great for dashboards and alerts, but they aggregate away detail.
  • Logs — discrete, timestamped records of events ("user 42 placed order 9001"). Rich in detail and context, but high volume and harder to aggregate.
  • Traces — the end-to-end path of a single request as it flows through many services, showing where time was spent. Essential for understanding latency in distributed systems.

No single pillar is sufficient. Metrics tell you that something is wrong and roughly where; traces tell you which hop is slow; logs tell you why a specific operation failed. Modern practice ties them together (for example, a trace ID embedded in logs) so you can pivot between them.

3 Metric Types: Counter, Gauge, Histogram, Summary

Metrics come in a few standard types, and choosing the right one matters:

  • Counter — a value that only ever increases (or resets to zero on restart), such as total requests served or errors seen. You typically look at its rate of change, not its raw value.
  • Gauge — a value that can go up or down, such as current memory in use, queue depth, or temperature. It is a snapshot of "right now".
  • Histogram — samples observations (e.g. request durations) into configurable buckets, letting you compute quantiles like the 95th-percentile latency on the server side via aggregation.
  • Summary — similar to a histogram but calculates configurable quantiles on the client side at observation time; its quantiles cannot be meaningfully aggregated across instances.

Rule of thumb: use a counter for things you count, a gauge for things you measure that fluctuate, and a histogram for distributions you want to slice by percentile across many instances.

# Prometheus-style metric exposition
# Counter: only goes up
http_requests_total{method="GET",code="200"} 13847

# Gauge: up and down
queue_depth{queue="emails"} 17

# Histogram: bucketed observations
http_request_duration_seconds_bucket{le="0.1"} 9001
http_request_duration_seconds_bucket{le="0.5"} 12044
http_request_duration_seconds_bucket{le="+Inf"} 13847

4 Time-Series Databases and Prometheus

Metrics are stored in a time-series database (TSDB): a store optimised for streams of (timestamp, value) points, each identified by a metric name plus a set of labels. Prometheus is the de-facto open-source standard.

Prometheus uses a pull model: the server periodically scrapes an HTTP endpoint (commonly /metrics) on each target, rather than waiting for targets to push data to it. Pulling makes the server the source of truth for what is up: a target that fails to be scraped is visibly down, and you can scrape any healthy instance from your laptop for debugging.

Targets are discovered through static config or service discovery (Kubernetes, Consul, cloud APIs). Each scrape stores all the current sample values with their labels. For pushing short-lived batch jobs that cannot be scraped, Prometheus offers a Pushgateway as a deliberate exception.

# A minimal prometheus.yml scrape config
global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'api'
    static_configs:
      - targets: ['10.0.0.5:8080', '10.0.0.6:8080']

# Prometheus then GETs http://10.0.0.5:8080/metrics every 15s

5 PromQL at a Glance

PromQL is Prometheus’ query language. A bare metric name returns an instant vector: the latest sample of every matching series. You filter by labels in braces and select a range with a duration in square brackets.

Because counters only increase, you almost never graph them directly — instead you wrap them in rate(), which gives the per-second average increase over a range. histogram_quantile() turns histogram buckets into a latency percentile. Aggregation operators like sum, avg and by collapse series across labels.

The example computes the per-second request rate over the last 5 minutes, summed per HTTP status code — a typical building block for error-ratio alerts.

# Per-second request rate over 5m, grouped by status code
sum(rate(http_requests_total[5m])) by (code)

# 95th-percentile latency from a histogram
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Error ratio
sum(rate(http_requests_total{code=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))

6 The Four Golden Signals

Google’s SRE practice distils service health into four golden signals. If you can only measure four things about a user-facing system, measure these:

  • Latency — how long requests take. Crucially, track the latency of successful and failed requests separately; a fast error is still bad.
  • Traffic — how much demand the system is under, e.g. requests per second.
  • Errors — the rate of requests that fail, whether explicitly (HTTP 500) or implicitly (wrong content, too slow).
  • Saturation — how "full" the most constrained resource is (CPU, memory, I/O, connection pool). Systems degrade as they approach saturation.

Watching these four gives broad coverage of user-perceived health with very few metrics, and they map cleanly onto alerts and dashboards.

7 The RED and USE Methods

Two complementary "methods" tell you which metrics to collect.

The RED method (Tom Wilkie) targets request-driven services. For every service, track:

  • Rate — requests per second.
  • Errors — failed requests per second.
  • Duration — distribution of request latencies.

The USE method (Brendan Gregg) targets resources (CPUs, disks, network interfaces, queues). For every resource, track:

  • Utilization — percent of time the resource was busy.
  • Saturation — how much extra work is queued and waiting.
  • Errors — count of error events.

RED is service/request-centric; USE is resource-centric. Together they cover both sides: how requests are served, and whether the underlying machinery is overwhelmed.

8 Logging Best Practices: Structure, Levels, Correlation

Good logs are structured: emitted as machine-parseable key/value records (typically JSON) rather than free-form prose. Structured logs can be filtered and aggregated without brittle regular expressions — you can query level=error AND user_id=42 directly.

Use log levels consistently. A common hierarchy, from most to least severe, is ERROR, WARN, INFO, DEBUG, TRACE. Production usually runs at INFO and can be turned up to DEBUG temporarily. Avoid logging at the wrong level: routine successes at ERROR cause alert fatigue, while genuine failures hidden at DEBUG get missed.

Attach a correlation ID (also called a request or trace ID) to every log line for a given request. As the request fans out across services, the shared ID lets you reconstruct the full story by filtering on one value — turning scattered lines into a coherent timeline.

// A structured (JSON) log line
{
  "ts": "2026-06-24T10:15:03Z",
  "level": "error",
  "msg": "payment declined",
  "correlation_id": "req-7f3a9",
  "user_id": 42,
  "amount": 19.99
}

9 Centralised Logging: ELK and Loki

On a single server you might tail a file, but across dozens of ephemeral containers you need centralised logging: a shipper collects logs from every host and forwards them to a central store you can search in one place.

The classic ELK / Elastic Stack pairs three tools: Logstash (or Beats) collects and transforms logs, Elasticsearch indexes the full text for fast search, and Kibana provides the search-and-visualise UI. Full-text indexing is powerful but storage- and resource-hungry.

Grafana Loki takes a lighter approach inspired by Prometheus: it indexes only a small set of labels (like app and level) and stores the raw log content compressed, rather than indexing every word. This makes it cheaper to run, at the cost of slower arbitrary full-text search. Both solve the same core problem: aggregate, store and query logs from many sources centrally.

10 Distributed Tracing: Spans and Trace Context

In a microservice architecture, one user request may touch a dozen services. Distributed tracing follows that request end to end.

A trace represents the whole journey of a single request and is composed of spans. Each span is one unit of work — a single service handling part of the request — with a name, a start time and duration, and key/value attributes. Spans form a tree: a parent span (e.g. the API gateway) has child spans (the services it called), so you see exactly where time went.

To stitch spans together across process boundaries, services propagate trace context: the trace ID and parent span ID are passed along, typically in HTTP headers. The W3C traceparent header is the standard format. When a service receives a request, it reads the incoming context, creates a child span, and passes updated context to anything it calls downstream.

# W3C trace context header carried between services
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
#            ^^ version
#               ^^^^^^^^^^^^ trace-id (whole request)
#                                            ^^^^^^^^ parent span-id
#                                                     ^^ flags (sampled)

11 OpenTelemetry: A Vendor-Neutral Standard

OpenTelemetry (OTel) is the open, vendor-neutral standard for generating and shipping telemetry — metrics, logs and traces — across languages and backends. It emerged from the merger of two earlier projects, OpenTracing and OpenCensus, and is now a CNCF project.

OTel provides language SDKs to instrument your code, a wire protocol (OTLP) to transmit the data, and the Collector, a standalone service that receives, processes (batching, filtering, enriching) and exports telemetry to one or more backends.

The big win is decoupling: you instrument once with OTel, and you can send data to Prometheus, Jaeger, a commercial vendor, or several at once — switching backends without re-instrumenting. Instrumentation can be manual (you create spans in code) or automatic (agents that wrap common libraries for you).

12 Dashboards and Visualization with Grafana

A dashboard turns raw telemetry into visuals a human can absorb at a glance. Grafana is the most popular open-source dashboarding tool: it connects to many data sources (Prometheus, Loki, Elasticsearch, SQL databases) and renders panels — graphs, gauges, tables and heatmaps — driven by queries such as PromQL.

Good dashboards follow some principles: put the most important signals (often the golden signals) at the top; use consistent time ranges so panels can be compared; show rates and percentiles rather than raw totals; and avoid overcrowding — a wall of 50 graphs hides the one that matters.

Dashboards are for understanding and exploration; they are not a substitute for alerting. Nobody watches a screen 24/7, so important conditions must page someone automatically. Treat dashboards as the place you go after an alert fires to diagnose what is happening.

13 Alerting: Thresholds, Rate-of-Change and Fatigue

Alerting turns a metric condition into a notification. The simplest rule is a static threshold ("error ratio > 5%"). But thresholds are blunt: too tight and they fire constantly, too loose and they miss real problems. Rate-of-change alerts catch trends, like disk filling fast enough to run out within an hour, well before a static "90% full" line trips.

The biggest operational risk is alert fatigue: when alerts are too noisy, flaky, or non-actionable, on-call engineers start ignoring them — and miss the one that mattered. The cure is to alert on symptoms users feel (the service is slow or erroring) rather than every internal cause, and to make every alert actionable: if there is nothing a human should do, it should not page.

Alerts that page a human (waking them at 3am) must be reserved for genuine, urgent, user-affecting problems. Lower-urgency issues can go to a ticket or a dashboard instead of a page.

# A Prometheus alerting rule
groups:
  - name: availability
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{code=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m])) > 0.05
        for: 10m
        labels: {severity: page}
        annotations: {summary: "5xx error ratio above 5% for 10m"}

14 On-Call and Paging

When an urgent alert fires, someone must respond. On-call is the rotation of engineers responsible for responding to production incidents at any hour. A paging tool (PagerDuty, Opsgenie, or similar) receives alerts and notifies the right person via push, SMS or phone call.

Key concepts make on-call humane and reliable: an escalation policy re-routes an alert to a backup or manager if the primary does not acknowledge within a few minutes, so a missed page never goes unanswered. Acknowledgement tells the system a human is now handling it. Sensible rotations spread the burden and prevent burnout.

A healthy on-call culture treats page volume itself as a signal: a noisy rotation is a bug to be fixed (better alerts, more reliability work), not a fact of life. Tracking alerts per shift keeps the load sustainable.

15 SLIs, SLOs, SLAs and Error Budgets

These three acronyms are often confused; keep them distinct.

  • An SLI (Service Level Indicator) is a measurement of some aspect of service quality — for example, the proportion of requests served successfully in under 300ms.
  • An SLO (Service Level Objective) is an internal target for an SLI — for example, "99.9% of requests succeed within 300ms over 30 days." It is a goal you set yourself.
  • An SLA (Service Level Agreement) is a contract with customers that includes consequences (refunds, credits) if the promised level is not met. SLAs are usually looser than the internal SLO so you have a safety margin.

The gap between perfect and your SLO is the error budget: at 99.9% availability you may be "down" for about 0.1% of the time. That budget is permission to take risk — ship features, do maintenance — as long as it is not exhausted. Burn through it and you freeze risky changes to focus on reliability.

16 Cardinality Pitfalls

Cardinality is the number of distinct time series a metric produces, which equals the number of unique combinations of its label values. A metric http_requests_total with a label code (a handful of values) is fine. Add a label whose values are nearly unique — a user ID, email, full URL with query string, or a raw timestamp — and the series count explodes.

This is a cardinality explosion: each unique label combination is a separate series the TSDB must store and index, so memory, disk and query time blow up, potentially crashing the metrics system. It is the most common way to accidentally take down Prometheus.

The rule: labels must be bounded and low-cardinality — use them for things with a small, stable set of values (status code, region, endpoint template). High-cardinality, per-request detail belongs in logs or traces, not metric labels.

17 Health Checks and Synthetic Monitoring

A health check is an endpoint (often /healthz or /readyz) that reports whether an instance is working. A liveness check answers "is the process alive, or should it be restarted?"; a readiness check answers "is it ready to receive traffic right now?" Orchestrators like Kubernetes and load balancers poll these to decide whether to route traffic to or restart an instance.

Synthetic monitoring goes further: instead of waiting for real users, a robot periodically performs scripted transactions against the system from outside — "log in, search, add to cart" — to verify the whole path works. Because it runs continuously, it can detect outages even when no real user happens to be active, and from multiple geographic locations.

Synthetic checks complement real-user telemetry: synthetics catch problems early and consistently, while real-user monitoring captures the actual, diverse experience of your users.

18 Incident Detection vs Diagnosis

Responding to an incident has two distinct phases, and good observability serves both.

Detection is knowing that something is wrong, ideally before customers tell you. This is the job of alerting on symptoms and the golden signals: a sharp rise in errors or latency should page you fast. The relevant measure is time-to-detect (often part of MTTD, mean time to detect).

Diagnosis is figuring out why it is wrong and where the fault lies, so you can fix it. This is where rich telemetry pays off: drill from a metric alert into traces to find the slow service, then into logs (joined by correlation ID) to see the exact error. It contributes to MTTR (mean time to resolve).

Designing for both matters: a system can be easy to detect problems in (a clear alert) yet painful to diagnose (no traces, unstructured logs). Aim to minimise both detection and resolution time.

19 Observability-Driven Development

Observability-driven development (ODD) treats telemetry as a first-class part of building software, not an afterthought bolted on once production breaks. Just as test-driven development asks "how will I know this is correct?", ODD asks "how will I know how this behaves in production?" before the code ships.

In practice this means: instrument new features with metrics, logs and traces as you write them; add the spans and structured fields you will want when debugging; and consider what questions on-call will need to answer at 3am. It pairs naturally with practices like testing in production and progressive delivery, where you ship behind a flag and watch real telemetry to validate behaviour.

The payoff is that when something surprising happens — an unknown-unknown — you already have the high-cardinality data to explore it, instead of discovering you are blind exactly when you need to see. Observability becomes a design requirement, owned by developers, rather than something operations adds later.

🎓 Certificate of Completion

🔒 Complete every lesson quiz above with 90%+ to unlock your downloadable certificate.