🤝

DevOps Culture & Practices Beginner

What DevOps really is: breaking down silos, CALMS, the Three Ways, flow, feedback and continual learning.

20 lessons 60 quiz questions
Lessons & quizzes Certificate

📚 Lessons & quizzes

Each lesson ends with its own short quiz. Answer them as you go — score 90% across all lessons to earn your certificate.

1 The wall between Dev and Ops

For decades, software organisations split into two camps with opposing incentives. Development (Dev) was rewarded for shipping change — new features, fast. Operations (Ops) was rewarded for keeping things stable — no outages, no surprises. Each pursued its own goal, and a metaphorical wall grew between them.

The result was the classic dysfunction: Dev would "throw code over the wall" to Ops at release time, often late on a Friday. When the release broke production, Dev blamed Ops for misconfiguring the servers, and Ops blamed Dev for shipping fragile code. Nobody owned the whole outcome, so problems festered in the gap.

This is the core problem DevOps was created to solve. When two groups are measured on conflicting goals and handed work over a wall, you get slow, painful, blame-filled releases. DevOps replaces the wall with shared ownership of delivering and running software.

2 A short history of DevOps

DevOps grew out of the Agile movement of the early 2000s, which had already shortened development cycles but stopped at the deployment boundary — Agile teams could build quickly, yet still handed releases to a separate, slow Ops process.

In 2008 Andrew Shafer and Patrick Debois discussed "Agile Infrastructure" at an Agile conference. In 2009 John Allspaw and Paul Hammond gave a famous talk, "10+ Deploys per Day: Dev and Ops Cooperation at Flickr", showing that frequent, safe deployment was possible when the two groups collaborated. Later that year Debois organised the first DevOpsDays in Ghent, Belgium — and the contraction "DevOps" came from its hashtag.

The ideas were popularised further by Gene Kim’s novel The Phoenix Project (2013) and The DevOps Handbook (2016). DevOps is therefore best understood as a cultural movement that extended Agile principles all the way through to operating software in production.

3 What DevOps is — and is not

DevOps is a culture and set of practices that unite software development and IT operations to deliver value to users faster and more reliably. At its heart it is about collaboration, shared ownership, automation and continuous improvement across the whole software lifecycle — from idea to running in production.

It is just as important to know what DevOps is not:

  • It is not a job title you can simply hire ("a DevOps") to replace your Ops team.
  • It is not a single tool or product you can buy — no tool makes you "DevOps".
  • It is not just automation scripts bolted onto a broken culture.
  • It is not only about technology; the cultural change is the hardest and most valuable part.

Buying a tool while keeping conflicting goals and the old wall in place is a common, expensive mistake. DevOps is a way of working, supported by — not replaced by — tools and automation.

4 The CALMS model

CALMS is a popular framework for assessing whether an organisation is ready for, and living, DevOps. It is an acronym coined by Jez Humble for five pillars:

  • C — Culture. Shared responsibility, collaboration and trust instead of silos and blame.
  • A — Automation. Automate repetitive, error-prone work: builds, tests, deployments, infrastructure.
  • L — Lean. Maximise flow of value and eliminate waste; work in small batches.
  • M — Measurement. Measure outcomes and use data to improve; you cannot improve what you do not measure.
  • S — Sharing. Share knowledge, tools, feedback and responsibility openly across teams.

CALMS is a useful lens because it puts Culture first and treats Automation as just one of five equal pillars. A team strong on tooling but weak on culture and sharing is not really doing DevOps.

5 The Three Ways: an overview

The DevOps Handbook and The Phoenix Project organise the principles of DevOps into the Three Ways. They build on each other:

  1. The First Way — Flow. Optimise the speed and smoothness of work flowing left-to-right, from Development through Operations to the customer.
  2. The Second Way — Feedback. Create fast, constant feedback flowing right-to-left, so problems are seen and fixed quickly and quality is built in.
  3. The Third Way — Continual Learning and Experimentation. Build a culture that rewards experimentation, learning from failure, and repeating practice to achieve mastery.

Think of them as flow (get value out fast), feedback (catch problems fast), and learning (get better forever). The next lessons unpack each Way in turn. Together they describe a system that delivers quickly, fails safely, and improves continuously.

6 The First Way: Flow

The First Way is about flow — making work move smoothly and quickly from left to right, from Development, through Operations, to the customer. The unit of value is a working change in the hands of a user, and the goal is to shorten the time it takes to get there.

Practices that improve flow include making work visible (for example on a kanban board), reducing batch sizes so changes are small, reducing the number of handoffs, and never knowingly passing a known defect downstream. By limiting work-in-progress and eliminating waste, value reaches the customer faster and with fewer surprises.

A useful mental image: a well-designed factory line where each item flows steadily to completion, rather than piling up in front of an overloaded station. In software, the "items" are changes flowing toward production.

7 The Second Way: Feedback

The Second Way creates fast and continuous feedback flowing from right to left — from operations and customers back to development — at every stage of the value stream. The aim is to detect problems as early as possible, while they are cheap to fix, and to amplify the signal so the whole team can act on it.

Concrete feedback loops include automated tests that fail fast in the pipeline, monitoring and alerting from production, and the practice of swarming — stopping to fix a problem the moment it appears, rather than working around it. A core idea borrowed from Lean manufacturing is the andon cord: anyone can "pull the cord" to halt the line when they see a defect, so it never propagates.

Fast feedback is what makes fast flow safe. Without it, shipping quickly just means breaking things quickly. With it, problems surface near their source and quality is built in rather than inspected in at the end.

8 The Third Way: Continual Learning

The Third Way is about building a culture of continual learning and experimentation. It rests on two ideas: continuously experimenting (taking risks and learning from both success and failure) and understanding that mastery comes from repetition and practice.

Practically, this means allocating time to improve daily work — not just doing the work, but improving the work — and deliberately injecting controlled stress to build resilience, for example through game days or chaos engineering. It also means treating failures as learning opportunities rather than occasions for blame, and sharing the lessons widely so the whole organisation improves.

A memorable maxim from the Toyota tradition is that improving daily work is even more important than doing daily work. The Third Way turns a team from one that merely ships into one that gets measurably better over time.

9 Shift-left: quality and security earlier

Shift-left means moving activities that traditionally happened late in the lifecycle — testing, security, performance and operability concerns — to the left, that is, earlier in the process. The name comes from picturing the pipeline as a timeline running left (early) to right (late, in production).

The motivation is economic: a defect found while a developer is still typing the code costs almost nothing to fix; the same defect found in production can cost orders of magnitude more, plus reputational damage. By writing tests alongside code, scanning for vulnerabilities in the pipeline, and thinking about operations during design, problems are caught when they are cheapest.

"Shift-left security" gave rise to the term DevSecOps — building security in from the start rather than bolting it on at the end. The general principle is the same across testing, security and operability: do not wait until the end to think about quality.

10 The automation mindset

Automation is the engine that makes DevOps practical. The mindset is simple: if a task is repetitive, error-prone and frequent, a computer should do it, freeing humans for judgement and creative work. Builds, tests, deployments, environment provisioning and routine operational chores are all candidates.

Automation brings three big benefits. It is consistent — a script runs the same way every time, eliminating "it worked on my machine" drift. It is fast — machines do in seconds what people do in hours. And it is self-documenting — the automation code is the record of how the process actually works.

A useful rule of thumb is to automate the toil: the manual, repetitive operational work that scales with usage and brings no lasting value. But automation is a means, not the goal — automating a bad process just lets you make mistakes faster, so simplify first, then automate.

11 CI/CD at a high level

Continuous Integration (CI) is the practice of developers merging their work into a shared mainline frequently — at least daily — with an automated build and test suite running on every merge. The goal is to catch integration problems within minutes instead of discovering them weeks later in a painful "merge hell".

Continuous Delivery (CD) extends this: every change that passes the pipeline is kept in a releasable state, so the software can be deployed to production at any time at the push of a button. Continuous Deployment goes one step further and automatically releases every passing change to production with no manual gate.

Together, CI/CD forms the automated path from a code commit to running software — the deployment pipeline. It is the practical machinery behind the First Way’s fast flow and the Second Way’s fast feedback. Small, frequent, automatically tested changes are the heartbeat of DevOps delivery.

# A tiny conceptual CI pipeline (pseudo-steps)
stage "build"  && make build       # compile / package
stage "test"   && make test        # run automated tests
stage "scan"   && make security    # shift-left security checks
stage "deploy" && deploy --env=prod # ship if all stages pass

# Each commit triggers the whole pipeline.
# A red stage stops the line before bad code reaches $PROD.

12 Infrastructure as Code (concept)

Infrastructure as Code (IaC) is the practice of defining and managing infrastructure — servers, networks, load balancers, databases — using machine-readable definition files, rather than configuring it by hand through a console or by clicking around a web portal.

Because the infrastructure is described in code, it can be stored in version control, reviewed, tested and deployed exactly like application code. This brings several wins: environments become reproducible (the same file builds the same environment every time), consistent (no more snowflake servers that drift from one another), and auditable (every change is a commit with a history).

A key related idea is idempotency: applying the same definition repeatedly converges to the same desired state without causing harm. IaC is what lets teams treat infrastructure as cattle, not pets — disposable and rebuildable, rather than hand-nursed and irreplaceable.

13 Blameless postmortems

When an incident happens, the most valuable thing a team can do is learn from it. A postmortem (sometimes called a retrospective or incident review) is a written analysis of what happened, why, how it was resolved, and what will change to prevent recurrence.

The crucial adjective is blameless. A blameless postmortem assumes that everyone acted reasonably given the information they had at the time, and focuses on the system and process that allowed the failure — not on punishing the individual who pushed the button. The premise is that humans rarely fail through malice; they fail because the system made failure easy.

This matters because of incentives. If people are blamed and punished for incidents, they hide mistakes, and the organisation stops learning. When postmortems are blameless, people surface problems honestly, root causes get fixed, and the same outage does not happen twice. Blamelessness is not about avoiding accountability — it is about directing attention to fixing systems rather than scapegoating people.

14 Site Reliability Engineering (SRE) basics

Site Reliability Engineering (SRE) is a discipline that originated at Google and is often described as "what happens when you ask a software engineer to design an operations function". Where DevOps is a broad cultural philosophy, SRE is one concrete, prescriptive way of implementing many of its ideas.

SRE treats operations as a software problem. SREs use software engineering to automate operational tasks, manage reliability with explicit targets, and deliberately cap the amount of manual operational work — toil — they will tolerate (Google’s well-known guideline is to keep toil below about 50% of an SRE’s time, reserving the rest for engineering that reduces future toil).

A core SRE belief is that 100% reliability is the wrong target: it is impossibly expensive and users cannot even perceive it. Instead, SRE sets a realistic reliability goal and engineers to it — an idea formalised by SLIs, SLOs and error budgets, which the next lessons cover.

15 SLIs, SLOs and SLAs

SRE makes reliability measurable with three related terms:

  • SLI — Service Level Indicator. A measurement of some aspect of service behaviour, such as the percentage of requests served successfully, or the proportion served faster than 200 ms. It is the raw number.
  • SLO — Service Level Objective. A target for an SLI over a window of time — for example, "99.9% of requests succeed over 30 days". It is the goal you choose internally.
  • SLA — Service Level Agreement. A contract with customers that includes consequences (such as refunds or credits) if the agreed level is not met.

The relationship is a hierarchy: you measure an SLI, you aim for an SLO, and you promise an SLA. Internal SLOs are normally set stricter than the externally promised SLA, so that you notice and react to degradation before you ever breach a customer contract.

16 Error budgets

An error budget is the flip side of an SLO. If your SLO says 99.9% of requests should succeed, then 0.1% are allowed to fail — that allowance is your error budget. It is, in effect, the amount of unreliability you are permitted to spend over a period before you breach the objective.

This reframes the old Dev-versus-Ops tension into a shared, data-driven agreement. As long as the budget is not exhausted, the team is free to spend it on shipping features and taking risks. If the budget runs out — too many failures this month — the policy is to slow down and prioritise reliability work until the service is healthy again.

Error budgets are powerful because they make the speed-versus-stability trade-off explicit and self-regulating. Nobody has to argue about whether to ship; the budget decides. It aligns Dev’s desire to move fast with Ops’ desire to stay stable, using one shared number.

17 The DORA four key metrics

The DORA (DevOps Research and Assessment) team studied thousands of organisations and identified four key metrics that distinguish high-performing software teams. They split neatly into two pairs — two about throughput (speed) and two about stability:

  • Deployment Frequency — how often you deploy to production. (Throughput)
  • Lead Time for Changes — how long from a commit to that change running in production. (Throughput)
  • Change Failure Rate — what percentage of deployments cause a failure needing remediation. (Stability)
  • Mean Time to Recover (MTTR) — how long it takes to restore service after a failure. (Stability)

The crucial DORA finding is that speed and stability are not a trade-off: elite teams score well on all four at once. Note that the metric is recover in MTTR — mean time to recover (restore service), not "repair" the root cause.

18 Value-stream mapping

A value stream is the full sequence of steps required to turn an idea into value delivered to a customer — in software, everything from "we have a request" to "it is running in production and being used". Value-stream mapping is a Lean technique for drawing that sequence end-to-end to see where time is really spent.

For each step you record two numbers: process time (how long the work actually takes) and lead time (how long it takes including waiting in queues). The eye-opening result is almost always that work spends far more time waiting — in approval queues, handoffs and "blocked" states — than being actively worked on.

The ratio of process time to total lead time is called flow efficiency, and it is often shockingly low. Mapping the value stream reveals the biggest bottlenecks and queues, so improvement effort is aimed where it actually shortens delivery — usually by removing waiting, not by making people type faster.

19 Small batch sizes and WIP limits

Two of the most powerful Lean ideas in DevOps are working in small batches and limiting work-in-progress (WIP).

Small batches: instead of accumulating a huge release of many changes, you ship small changes frequently. Small batches are easier to test, easier to reason about, and dramatically easier to debug — when something breaks after a one-line deploy, the cause is obvious; when it breaks after a 500-change quarterly release, finding the culprit is a nightmare. Small batches also deliver value sooner and create faster feedback.

WIP limits: a team caps how many items it works on at once. This sounds counter-intuitive — surely doing more things finishes more? — but starting everything at once just creates queues and context-switching, so nothing finishes. The Lean insight is to stop starting and start finishing: limiting WIP exposes bottlenecks and actually increases the rate at which work is completed.

20 Psychological safety and collaboration

Underneath every DevOps practice lies a cultural foundation: psychological safety. Coined by researcher Amy Edmondson, it is the shared belief that a team is safe for interpersonal risk-taking — that you can ask a question, admit a mistake, raise a concern or challenge a decision without fear of humiliation or punishment.

Google’s well-known Project Aristotle study of what makes teams effective found psychological safety to be the single most important factor — more important than who was on the team. It is also the prerequisite for nearly everything else in this course: blameless postmortems only work if people feel safe to be honest; fast feedback only helps if people are willing to surface bad news; continual learning only happens where failure is safe to discuss.

Collaboration, shared ownership and trust are not "soft" extras layered on top of DevOps — they are the substrate that makes the technical practices function. Tools and pipelines amplify a healthy culture; they cannot create one.

🎓 Certificate of Completion

🔒 Complete every lesson quiz above with 90%+ to unlock your downloadable certificate.