General Data Engineer Updated 2026-05-21

Data Engineer Interview Questions — Complete 2026 Guide

Data engineer interviews in 2026 are no longer just about whether you can write a SQL query or stand up an Airflow DAG. Hiring managers at Databricks-shop startups, Snowflake-native scale-ups, and FAANG platform teams are screening for a tighter bundle: SQL fluency under time pressure, distributed-systems intuition that holds up when a Spark job spills to disk, and the judgment to know when a pipeline is too expensive to run nightly. This guide walks through the loop end-to-end, the data engineer interview questions that come up at every stage, and the patterns that separate offers from polite rejection emails.

The Data Engineer interview funnel

Most data engineer loops in 2026 follow the same shape. The standard funnel runs: recruiter screen (20–30 minutes, motivation and salary calibration), hiring-manager call (45 minutes, project deep-dive plus light technical probes), technical screen (60–90 minutes, almost always SQL plus a Python or PySpark coding round), and a virtual onsite of 3 to 5 rounds. Senior loops add a system design round focused on pipeline architecture and a reliability or on-call conversation. According to a recent Dataquest summary of 2026 interview trends, scenario-based system design and trade-off reasoning have overtaken algorithmic puzzles as the dominant signal interviewers chase.

The onsite splits into four predictable buckets. The SQL and modeling round tests window functions, slowly-changing dimensions, and partition pruning against a fictional fact-and-dimension schema. The pipeline and infrastructure round asks about Airflow or Dagster DAGs, Spark tuning, Kafka throughput, or dbt project layout. The system design round zooms out: design an ETL platform for clickstream data, a streaming pipeline for fraud detection, or a lakehouse for a multi-tenant SaaS. A behavioral round closes the loop, and senior candidates almost always get a reliability conversation about incidents, SLOs, and cost.

Loop length varies. Most candidates report 2 to 4 weeks from recruiter screen to offer. Take-home assignments are still common at startups — usually a 4 to 8 hour ETL or dbt task. Contract-to-hire roles (3 to 12 months) have become more common as companies accelerate cloud migrations, and those compress to two technical rounds and a reference check.

Top behavioral questions

Behavioral rounds for data engineers focus less on leadership theater and more on operational scars. Interviewers want to hear about pipelines that broke, schemas that fought back, and bad data that escaped into a dashboard the CEO was watching. The strongest candidates lead with a measurable outcome — minutes-to-recovery, rows reconciled, dollars saved — not the tech stack.

Expect questions like:

  • “Tell me about a pipeline failure you owned end to end.” Walk through the incident: how the alert fired, the first hypothesis, what was wrong with it, the actual root cause, the fix, and the prevention. Naming a specific tool (Datadog alert, Sentry trace, dbt test failure) lands better than “the pipeline went down.”
  • “Describe a schema migration that went wrong.” Hiring managers want backward compatibility, dual writes, shadow reads, and the rollback plan that did or did not exist. Mention how downstream consumers were notified — a Slack ping does not count.
  • “How do you handle bad data that has already reached production?” Expected answer: quarantine, backfill, communication to stakeholders, post-incident review. “I added a dbt test” without naming the contract or alerting destination signals shallow ownership.
  • “Walk me through a time you disagreed with a data scientist or analyst about a model.” Strong answers reference lineage, source of truth, and where the transformation belongs (raw, staging, mart).

Behavioral rounds are graded on specificity and ownership — the more you can quote actual numbers, alert thresholds, and post-mortem outcomes, the higher you score.

SQL, modeling, warehousing questions

SQL is still the gateway round, and warehouse-native modeling is where most candidates get exposed. In 2026, interviewers assume you can write a SELECT with JOIN and GROUP BY without thinking. The bar is window functions, modeling for change over time, and physical layout decisions that affect cost.

Expect questions like:

  • Window functions. “Compute a 7-day rolling average of orders per user, plus the rank of each user by total spend within their signup cohort.” Practice combining ROW_NUMBER, RANK, LAG, and SUM OVER PARTITION BY in a single CTE, and be ready to explain RANK versus DENSE_RANK on ties.
  • Slowly-changing dimensions. Describe Type 1 (overwrite), Type 2 (new row with effective dates), and Type 6 (hybrid) on a customer dimension. Bonus points for naming valid_from, valid_to, is_current, and the surrogate-key pattern.
  • Idempotent upsert. “Write a MERGE that lets the same job run 10 times without duplicating rows.” The canonical answer uses a deterministic primary key plus a deduplication CTE on ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) = 1 before the MERGE.
  • Partition pruning. Given a fact table partitioned on event_date, why does WHERE DATE_TRUNC('day', event_ts) = '2026-05-01' scan the whole table while WHERE event_date = '2026-05-01' scans one partition? Wrapping a partition column in a function disables pruning.
  • Late-arriving data. Events that show up two days after the partition was sealed — how do you reconcile them? Common answers: a late_arrivals staging table plus a scheduled reprocess, or watermark-based re-aggregation if the warehouse supports it.
  • Star vs Snowflake vs One Big Table. When does denormalization win? Cost-aware candidates name the query patterns and the cost of joins on columnar storage.

Mentioning dbt tests (unique, not_null, relationships, accepted_values) is the cheapest way to signal modern hygiene. Senior candidates also reference data contracts and producer-owned schemas — both increasingly standard in 2026.

Pipeline and infra questions

The infrastructure round is where data engineer interview questions diverge from analytics engineer ones. Expect at least one question on each of orchestration, distributed compute, streaming, and storage format.

  • Airflow vs Dagster vs Prefect. Describe DAG-based orchestration (Airflow), asset-based orchestration with software-defined assets (Dagster), and the pull-based philosophy of Prefect. Hiring managers want trade-offs (Airflow’s ecosystem versus Dagster’s lineage-first model), not winners.
  • Spark internals. Explain shuffles, narrow vs wide transformations, broadcast joins, skew handling (salting), and when to persist or cache a DataFrame. A common question: “Your Spark job runs in 6 hours instead of 30 minutes — what do you check first?” Walk through the Spark UI, the longest-task stage, partition skew, GC pressure, and shuffle spill.
  • Kafka. Topics, partitions, consumer groups, replication factor, exactly-once semantics via idempotent producers and transactions. Expect a question on what happens when a consumer falls behind.
  • dbt project structure. Sources, staging, intermediate, marts. Macros, snapshots for SCD Type 2, and the difference between incremental and full-refresh. Be able to explain is_incremental() and how to write a backfill-safe incremental model.
  • Lakehouse vs warehouse. Lakehouses (Iceberg, Delta Lake, Hudi on S3 or GCS) decouple storage from compute. Warehouses (Snowflake, BigQuery, Redshift) tightly couple them for fast queries on managed storage. Hybrid setups — external Iceberg tables queried from Snowflake — are now common.
  • Kappa vs Lambda. Lambda runs batch and streaming in parallel and reconciles. Kappa runs a single streaming pipeline and replays from the log. Kappa is cheaper to operate; Lambda is easier to backfill from a known-good batch source.

What hiring managers look for

Beyond raw technical accuracy, the bar in 2026 is reliability and cost. Multiple industry sources — Data Engineering Weekly, the dbt blog, and the Locally Optimistic newsletter — have flagged the same shift: companies are no longer hiring data engineers to build new pipelines. They are hiring to make existing pipelines stop breaking and stop burning cash.

That shift shows up in three predictable ways. First, hiring managers ask about SLAs and SLOs explicitly. “How quickly does your nightly job need to land for the morning standup dashboards to be fresh?” If you cannot name a number, you read as someone who has never been on call. Strong candidates name a freshness SLA (data is no more than 90 minutes stale), a completeness SLA (99.5% of rows match within 24 hours), and an availability target on the query layer.

Second, cost shows up in nearly every senior loop. Expect to be asked what your largest pipeline cost per month, how you measured it, and what you did about it. Specific numbers (“we cut Snowflake credits 32% by switching the daily marts model to incremental with a 7-day lookback”) beat generic claims. FinOps awareness — monitoring warehouse credits, BigQuery slot usage, S3 storage tiers, idle compute — is a baseline expectation.

Third, data contracts and observability are standard talking points. Producer-owned schemas with explicit consumer SLAs, lineage tools (OpenLineage, dbt’s manifest, Datafold, Monte Carlo), and alerting strategies (Slack, PagerDuty) all come up. Saying “Monte Carlo for anomaly detection on the top 20 tables and dbt tests on everything else” signals tiered ownership. Senior loops grade hard on this — a senior who cannot name a cost number or an SLA usually gets leveled down to mid before the offer goes out.

Questions to ask them

The questions you ask matter as much as the ones you answer. Strong data engineer questions probe operational reality and surface red flags before the offer hits your inbox.

  • “What does the on-call rotation look like, and how many pages did the team get last week?” Zero is suspicious (nothing is monitored); 20 is suspicious (the platform is on fire). Five to ten is healthy.
  • “Who owns data quality when a pipeline fails at 2am?” Listen for whether ownership is clear (a named team, a rotation) or diffuse (“usually the person who wrote it”). The latter signals inherited broken DAGs.
  • “What’s the current backlog of broken or flaky pipelines?” Every team has debt; the question is whether they can quantify it. A specific number (“14 DAGs marked deprecated, 3 we still rerun manually”) signals discipline.
  • “How does the team measure pipeline reliability?” Strong answers reference SLOs, error budgets, freshness and completeness metrics, or an incident-per-quarter count.
  • “Do you have a Snowflake or BigQuery budget?” Cost-aware teams have a budget and a monthly review; cost-unaware teams discover overruns from the CFO.
  • “What’s the most painful migration on the roadmap?” Surfaces what you would actually work on in your first six months — often a warehouse migration, a Kafka upgrade, or a move to an open table format.

Save culture and perks questions for the recruiter or the offer call.

Common mistakes

Three failure modes show up again and again in data engineer interview debriefs.

The first is claiming distributed-systems experience without depth. Listing Spark on your resume and then being unable to explain a shuffle, a broadcast join, or what happens when a stage spills to disk is the fastest way to lose a senior loop. If you have only used Spark through dbt or SQL-on-Spark, say so explicitly — interviewers respect the calibration.

The second is treating data quality as someone else’s job. Candidates who describe a pipeline as “the model worked, the analyst caught the bug” signal builder mindset, not operator mindset. Strong candidates volunteer the tests they wrote, the contract they enforced, and the alert that fired before the analyst noticed. Hiring managers grade reliability ownership as a top-three signal in 2026.

The third is not knowing what your pipeline cost. A widely quoted r/dataengineering thread from early 2026 on senior interviews hammers this point: candidates who cannot quote a warehouse credit number, a partition strategy, or a storage tier choice get leveled down. Even if you do not own the bill, ask before the interview. Naming a real number — “the marts layer cost about $4K a month on Snowflake before we partitioned it” — beats any architecture diagram.

Smaller mistakes: over-engineering streaming at a batch shop, naming exotic tools the team does not use, confusing Type 1 and Type 2 SCDs under pressure. Prep the fundamentals cold, rehearse one system design out loud per day, and walk in with one cost number and one SLA you can quote from memory. That is the bar in 2026.

Frequently asked questions

How long is a typical data engineer interview loop in 2026?

Most loops run 4 to 6 rounds across 2 to 4 weeks: a recruiter screen, a hiring-manager call, a technical screen (almost always SQL plus a coding or modeling question), and a virtual onsite of 3 to 5 rounds. Senior loops add a system design round focused on pipeline architecture and an on-call or reliability conversation. Short contract roles (3–12 months) often compress this to two technical rounds plus a reference check.

What SQL topics come up most often in data engineer interviews?

Window functions (ROW_NUMBER, LAG, LEAD, RANK, SUM OVER), slowly-changing dimensions (Type 1 and Type 2), partition pruning, deduplication patterns, and incremental merge logic. Expect at least one question on how to write an idempotent upsert and one on how to backfill a table without breaking downstream consumers.

Do I need to know Spark for every data engineer interview?

If the role touches terabyte-scale data, yes. For dbt-and-warehouse shops where most work runs inside Snowflake or BigQuery, deep PySpark internals are nice-to-have rather than mandatory. Read the job description: any mention of Databricks, EMR, or Glue means you should be able to explain shuffles, skew, broadcast joins, and the difference between narrow and wide transformations.

What's the difference between a data engineer and analytics engineer interview?

Analytics engineer loops lean heavier on dbt, modeling, stakeholder communication, and BI tools. Data engineer loops add infrastructure: orchestration (Airflow, Dagster, Prefect), distributed compute (Spark, Flink), streaming (Kafka, Kinesis), and platform concerns like lineage, observability, and cost. Many companies still conflate the titles, so always confirm scope in the recruiter screen.

How should I prep for the system design round?

Pick three reference architectures and rehearse them out loud: a batch ELT pipeline (CDC source, S3/object landing, dbt on warehouse, BI consumer), a streaming pipeline (Kafka, stream processor, lakehouse sink), and a lakehouse design using Iceberg or Delta Lake. Practice naming SLAs, partitioning strategy, schema evolution, late-arriving data, and one cost number per layer.

Are take-home assignments still common?

Yes, especially at mid-size companies and remote-first startups. Expect a 4 to 8 hour task: build a small pipeline that ingests raw JSON or CSV, lands it in a warehouse or DuckDB, and writes 2–3 dbt models with tests. Spend a third of your time on the README — most candidates over-invest in code and under-invest in explaining trade-offs and what they would harden in production.

What gets candidates rejected most often?

Three failure modes dominate: claiming Spark experience without being able to explain shuffles or partitioning, treating data quality as someone else's problem, and not knowing what a pipeline costs to run. Strong candidates name a cost number, an SLA, and at least one incident they owned.

How much streaming knowledge is expected at the mid level?

Junior and mid-level roles mostly focus on batch and micro-batch. Streaming is 'good to know' for mid-level and expected at senior. If the team uses Kafka or Kinesis in production, expect questions on exactly-once semantics, watermarks, consumer groups, and the Kappa vs Lambda architecture debate.

Should I mention data contracts and FinOps in my interview?

If the role is at a company with a real platform team, absolutely. Data contracts (producer-owned schemas with explicit SLAs) and FinOps (monitoring Snowflake credits, BigQuery slot usage, S3 storage tiers) are now standard talking points for senior data engineer interviews in 2026. Saying 'we cut our warehouse bill by 30% by partitioning on event_date' lands far better than abstract claims.

Do I need to know Iceberg or Delta Lake to pass?

Not always, but the lakehouse pattern shows up in roughly half of senior loops. Be able to explain why open table formats matter (ACID on object storage, time travel, schema evolution, multi-engine reads) and the practical difference between Iceberg, Delta Lake, and Hudi. A surface-level answer is fine — depth is only expected if the role explicitly lists one of them.

How important is Python versus Scala or Java?

Python plus SQL covers roughly 90% of US data engineer postings in 2026. Scala still shows up at companies running heavy Spark workloads (and pays a premium), and Java appears at Kafka-heavy shops. If you only know Python, prep one or two Spark questions in PySpark and skip the Scala-flavored ones — most interviewers accept either.

What questions should I ask the interviewer?

Ask about on-call rotation, who owns data quality when a pipeline fails at 2am, what the current backlog of broken DAGs looks like, and how the team measures pipeline reliability. These questions signal that you think like an operator, not just a builder, and they surface red flags before you accept the offer.