General DevOps Engineer Updated 2026-05-21

DevOps Engineer Interview Questions — Complete 2026 Guide

You opened this guide because a DevOps engineer interview is on the calendar and the standard “top 50 questions” lists are not cutting it. This guide is written for working DevOps, SRE, and platform engineers with two to ten years of experience targeting mid-level through staff roles at scaleups, enterprises, and remote-first shops. By the end you will know the actual shape of a 2026 DevOps loop, the scenario questions that show up in every round, the signals hiring managers score behind the rubric, and the mistakes that turn solid engineers into polite rejections. No tool name-dropping, no AI-generated filler, no recycled “explain DevOps in three sentences.”

The DevOps interview funnel

A 2026 DevOps loop usually runs four to six rounds over two to three weeks. The shape has shifted in the last eighteen months because so many roles titled DevOps Engineer are now functionally platform engineering or SRE positions. Gartner forecasts 80 percent of organizations will have a dedicated platform engineering team by 2026, and the CNCF 2025 annual survey put Kubernetes production adoption at 82 percent. That combination means almost every loop will touch Kubernetes, GitOps, and internal developer platforms regardless of what the title says.

Typical structure:

  1. Recruiter screen, 30 minutes. Compensation, work authorization, why DevOps, on-call comfort.
  2. Technical phone screen, 45 to 60 minutes. Linux fundamentals, scripting (Bash or Python), one infrastructure scenario.
  3. Take-home or live exercise, 1 to 4 hours. Fix a broken Terraform module, write a GitHub Actions workflow, or debug a failing Helm chart.
  4. Onsite or virtual onsite, 3 to 5 rounds: CI/CD and IaC deep dive, Kubernetes and containers, observability and incident response, behavioral, hiring manager close.
  5. Optional bar-raiser or cross-functional panel with a senior developer or SRE.

Time-to-hire for infrastructure roles in U.S. tech sits around 38 to 45 days. Mid-funnel drop is high because companies stall on the technical loop while waiting for the right panel. Plan your prep window at six weeks and keep two or three parallel processes alive so a single slow recruiter does not blow your timeline.

Top behavioral questions

Behavioral rounds in a DevOps loop are not soft warm-ups. They are where postmortem culture, on-call discipline, and blameless thinking get scored. Use STAR, keep answers to 90 to 120 seconds, and quantify outcomes (MTTR, error rate, deploy frequency, page volume).

  • “Tell me about an outage you caused or contributed to.” Pick a real one. Walk through what changed, detection time, mitigation, the postmortem you wrote, and the action items you actually closed. Saying “we made it blameless” is not enough; show the specific process or tool change.
  • “Describe an on-call rotation that was unsustainable.” The 2025 SRE Report from Catchpoint found engineers spend a median 30 percent of their week on operational work, up from 25 percent the year before, and 46 percent of respondents had handled more than five incidents in the last 30 days. Interviewers want to hear you noticed the load, named it, and pushed for structural change (alert reduction, runbook automation, rotation expansion), not that you toughed it out.
  • “When did you push back on a developer or product manager?” Show you escalated with data, proposed a path forward, and preserved the relationship.
  • “Walk me through a postmortem you led.” Bonus points for naming a contributing factor that was systemic (capacity planning, missing alert, runbook gap) rather than human.
  • “How do you handle alert fatigue?” Concrete answer: audit the top ten noisiest alerts, kill or tune the bottom half, route the rest based on actionability. Generic answers lose.

The pattern across all five: specificity beats philosophy. A peer engineer telling a story about one bad week outscores a polished candidate quoting Google’s SRE book.

CI/CD, infrastructure-as-code, containers

This is the longest round in most loops and the one with the widest skill spread between candidates. Hiring managers are checking whether you can design a pipeline, write idempotent infrastructure, and reason about container internals — not whether you have memorized flags.

CI/CD. Expect a design question like “walk me through a pipeline that takes a microservice from commit to production with safe rollback.” A strong answer covers: trunk-based development, fast feedback loops (under ten minutes for unit and integration), artifact promotion across environments rather than rebuilding, signed images, and automatic rollback on SLO regression. Mention blue-green or canary deployment explicitly. If the company runs progressive delivery with Argo Rollouts or Flagger, name it.

Terraform and IaC. Core topics: remote state with locking, module structure (composition vs monolithic root modules), drift detection, plan-then-apply gating in CI, and how you handle a partially-applied state after a failure. The differentiator in 2026 is GitOps integration. The CNCF 2025 end-user survey reported nearly 60 percent of surveyed Kubernetes clusters now rely on Argo CD, with 97 percent of Argo CD users running it in production. Expect at least one question on how you reconcile Terraform-managed cloud resources with Argo CD-managed Kubernetes manifests without overlapping ownership.

Containers and Kubernetes. Scenario questions dominate: “a pod is stuck in CrashLoopBackOff, what do you check”; “explain the difference between requests and limits and what happens when you set limits but not requests”; “design an immutable image pipeline with vulnerability scanning.” Be ready to talk about probes (readiness vs liveness vs startup), pod disruption budgets, and why you would or would not use a sidecar. Immutable infrastructure should be a default assumption, not a buzzword you trot out.

If you have never run any of this in production, build a small cluster on kind or k3d, deploy a real app, break it, fix it, and write the postmortem. Hands-on stories beat certifications every time.

Observability, monitoring, incident response

This round separates engineers who absorb pain from engineers who reduce it. The framing is almost always a scenario: “your top customer is reporting intermittent 500s, walk me through your first thirty minutes.”

Vocabulary you need fluently: SLI (the measurement), SLO (the target), SLA (the contract), error budget (how much unreliability you can spend before freezing releases), burn rate alerting (paged on multi-window rate, not raw threshold), and the difference between symptom-based and cause-based alerts. If you do not naturally say “we alerted on the symptom (user-facing latency) not the cause (one bad node),” that is the gap to close.

Alert fatigue questions are now standard. Industry data shows on-call teams routinely receive over 2,000 alerts per week with only about three percent requiring immediate action. Datadog and incident.io reports from 2025 both rank alert noise as the top contributor to on-call burnout. A strong answer: regularly audit the noisiest alerts, delete or tune anything that fires more than a few times a quarter without action, route the rest to chat instead of pages when severity does not warrant interruption. The Google SRE Workbook recommendation of a maximum of two to three actionable incidents per shift is a fair anchor to cite.

Runbook questions: “what makes a runbook actually useful at 3 a.m.” Answer: short, specific, copy-pasteable commands, decision trees for the common failure modes, and a link to the dashboard with the metric that triggered the page. If the runbook is a wall of prose, it is decoration.

Bring up chaos engineering at least once. Game days, fault injection with tools like Litmus or Gremlin, and load shedding are differentiators that signal you do not wait for production to surface the next failure mode.

What hiring managers look for

Beneath the rubric, every DevOps hiring manager is scoring the same underlying tension: reliability versus velocity. The strongest candidates show they can hold both at once.

Concrete signals that move the needle:

  • Reduces toil rather than absorbing it. The candidate who automated a runbook step or killed a recurring alert outscores the candidate who proudly handled fifty pages last quarter.
  • Thinks in error budgets. Saying “we paused feature work for two weeks because we burned our SLO” demonstrates real prioritization in front of product pressure.
  • Picks incremental fixes over rewrites. Proposing to rip out Jenkins for GitHub Actions on day one is a flag. Proposing to migrate one pipeline first, measure, then expand is a hire signal.
  • Owns the developer experience. Internal platforms are mainstream in 2026 (CNCF puts cloud-native developers near twenty million). Hiring managers want engineers who treat the dev team as customers, not adversaries.
  • Communicates blameless cause-and-effect. In any postmortem or scenario question, the candidate who separates contributing factors from blame is the candidate who can be trusted in a war room.

What loses: name-dropping tools without operational stories, advocating for shiny migrations, dismissing legacy systems, and answering reliability questions with uptime percentages instead of MTTR or error budget burn. Hiring managers have heard every variation of “we hit five nines.” They want to know what happened the time you did not.

Questions to ask them

The end-of-loop question slot is a scoring opportunity, not a courtesy. Three questions that consistently impress DevOps hiring managers:

  1. “How big is the on-call rotation, and what is the median page volume per engineer per week?” This signals you take operational load seriously. If the rotation is under four people or pages exceed ten per week per engineer, that is a real flag worth raising before you sign.
  2. “Walk me through the last severity-one incident. What changed in the system or process afterward?” A team with a healthy postmortem culture will answer this easily. A team that struggles to remember or that blames a person is telling you something about how the next one will go.
  3. “Where is the friction between developer self-service and platform team approval?” Every internal platform has this tension. The honest answer reveals how mature the platform team is and how much political work the role will involve.

Two more if you have time: ask how the team measures toil reduction quarter over quarter, and ask which tools the team has actively removed in the last year. The second is a fast read on whether the team prunes its stack or accretes complexity.

Avoid asking about compensation, remote policy, or vacation in this slot. Those go to the recruiter. The end-of-loop question is for surfacing real operational signal.

Common mistakes

Patterns that show up in no-hire feedback over and over:

  • Tool theater. Reciting “we use Terraform, Argo CD, Prometheus, Grafana, Datadog, PagerDuty” without any story behind each. Pick the two you know coldest and have a real anecdote for each.
  • Heroic on-call stories. Bragging about working a 36-hour incident reads as a process failure, not a strength. Reframe around what you changed so the next incident took 30 minutes.
  • Blame slippage. Saying “the developer pushed bad code” in a postmortem question, even casually. The blameless framing is non-negotiable in 2026.
  • Big-bang migration proposals. Suggesting you would rewrite the CI system or rip out Jenkins on day one. Senior interviewers immediately mark you as someone who has not lived through a migration.
  • SLO confusion. Mixing up SLI, SLO, SLA, or treating uptime percentage as a meaningful target without latency, error rate, or saturation context.
  • Ignoring developer experience. Talking about reliability and security as if developers are obstacles instead of customers. Platform engineering questions catch this fast.
  • Faking depth. Claiming Kubernetes operator experience and then fumbling the controller-reconciliation loop. Interviewers probe two layers deep on any tool you list; honesty about which ones you have only read about is always the better play.

Walk in with two strong incident stories, one clean Terraform refactor or pipeline redesign story, and a clear point of view on alert fatigue. That base, plus comfort with the scenario format, beats memorizing two hundred trivia questions every time.

Frequently asked questions

How long is a typical DevOps engineer interview loop in 2026?

Four to six rounds spread across two to three weeks. Recruiter screen (30 min), a technical phone screen on Linux or scripting (45 to 60 min), then 3 to 5 onsite-style rounds: one CI/CD or infrastructure-as-code deep dive, one Kubernetes or container round, one observability or incident-response scenario, and at least one behavioral. Many shops now add a take-home that asks you to fix a broken Terraform module or write a small operator.

Is DevOps still a separate role from platform engineering and SRE in 2026?

On paper, no. In job postings, yes. A significant share of roles posted as DevOps Engineer in 2026 are functionally platform engineering positions: building internal developer platforms, golden paths, and self-service tooling on top of Kubernetes. Gartner forecasts 80 percent platform engineering adoption by 2026, so expect interview questions about Backstage, internal IDPs, and Crossplane even when the title says DevOps.

What is the non-negotiable tooling stack for a DevOps interview?

Git, Docker, Kubernetes, Terraform, one CI/CD tool (GitHub Actions or Jenkins or GitLab CI), one cloud (AWS, Azure, or GCP), and one observability stack (Prometheus and Grafana, or Datadog). If a posting lists all six, every round will touch at least two of them. Pick the cloud and CI tool the company actually runs in production and go deep there rather than spreading thin.

How heavily is Kubernetes tested at the senior level?

Heavily, and usually in scenario form, not trivia. Expect questions like 'a pod is stuck in CrashLoopBackOff, walk me through your debugging path' or 'a node went NotReady at 3 a.m., what do you check.' Memorizing kubectl flags will not save you. Knowing how to read events, describe a pod, check kubelet logs, and explain readiness vs liveness probes will.

What is the difference between SRE and DevOps in interviews?

SRE interviews lean harder on math (error budgets, percentile latency, queueing theory) and incident response. DevOps interviews lean harder on automation, pipelines, and developer experience. The overlap is huge, especially at companies under 500 engineers where one team owns both. If the role calls itself DevOps but the JD lists SLOs and burn rate alerting, prep both sides.

How do I answer the 'tell me about an outage you caused' question?

Be specific, blameless toward others, and own the systemic gap. State what changed, when it broke, how detection worked or did not, the time to mitigate, and what you changed in process or tooling so it cannot recur. The interviewer is scoring whether you write blameless postmortems and whether you actually closed the loop on action items, not whether you have ever broken production.

Will AI assistants like Copilot be allowed during the interview?

For live whiteboard or pair-debugging rounds, usually no. For take-home tasks, often yes, but only if you disclose. Ask the recruiter in writing before the onsite. If the company runs an AI SRE tool internally (incident triage, runbook automation), expect at least one question on how you would evaluate and integrate it without creating new alert noise.

What metric do hiring managers actually care about most?

Mean time to recovery (MTTR), not mean time between failures. Modern services fail; the question is how fast a small team can detect, mitigate, and write a postmortem that changes something. Candidates who talk about MTTR, error budgets, and reducing toil consistently outscore candidates who talk about uptime percentages.

How important is Terraform versus Pulumi or CDK in 2026?

Terraform is still the dominant answer in interviews. HCL knowledge, state management, remote backends, module structure, and drift detection come up in almost every DevOps loop. Pulumi and CDK come up at companies that already use them. Be honest about which one you have run in production; faking depth on a tool the interviewer uses daily is the fastest way to fail.

What separates a hire from a no-hire in a DevOps loop?

Three signals: clear reasoning under pressure during an incident scenario, a real preference for reliability over heroics, and concrete examples of reducing toil rather than absorbing it. No-hire votes usually come from candidates who name-drop tools without operational stories, propose rewrites instead of incremental fixes, or get defensive in the postmortem question.

How should I prep if I have never run Kubernetes in production?

Stand up a small cluster on kind or k3d, deploy a real app with Helm, break it on purpose, and write the postmortem. Read the official Kubernetes troubleshooting docs end to end. Run through the CKAD or CKA exam objectives even without taking the test. Six weekends of hands-on work outperforms six months of reading.

What questions should I ask the hiring manager at the end?

Ask about on-call rotation size, page volume per week, and how the team measures toil. Ask how the last severe incident was handled and what changed afterward. Ask about the gap between developer self-service and platform team approval. These three questions signal you think about operational reality, not just the tech stack on the JD.