General Data Scientist Updated 2026-05-21

Data Scientist Interview Questions — Complete 2026 Guide

Data scientist interviews in 2026 are no longer just about whether you can train a model. Hiring managers at Meta, DoorDash, Microsoft, and the next wave of Series-B startups are screening for a tighter bundle: SQL fluency under time pressure, statistical reasoning that connects to product decisions, and the judgment to know when a 0.3% lift is worth shipping. This guide walks through the loop end-to-end, the questions that come up at every stage, and the patterns that separate offers from polite rejection emails.

The data scientist interview funnel

Most data scientist loops in 2026 follow the same shape, with small variations by company size. The standard funnel is: recruiter screen (20–30 minutes, motivation and salary calibration), hiring-manager call (45 minutes, project deep-dive plus light technical), technical screen (60 minutes, almost always SQL plus one statistics or Python question), and a virtual onsite of 3 to 5 rounds. DoorDash and TikTok have publicly documented 7-step funnels; Meta and Microsoft typically run a 4–5 round onsite.

The onsite itself splits into four predictable buckets. The SQL and analytics round tests window functions, joins, and edge cases against a fictional product schema. The statistics and probability round mixes A/B testing design with classic probability puzzles and hypothesis-testing knowledge. The ML and modeling round asks you to choose models, defend feature choices, and walk through bias-variance tradeoffs. The case study round is the make-or-break — interviewers frame a loose product question, then watch you structure it. A behavioral round closes the loop, and at senior levels a system design or ML system design round gets added.

Loop length varies. Most candidates report 2 to 4 weeks between recruiter screen and offer; senior loops with reference calls can stretch to 6 weeks. Take-home assignments are still common at startups and mid-size companies, usually given between the technical screen and onsite.

SQL and analytics questions

SQL is the gateway round. A 2024 Forrester report still cited by KDnuggets in 2025 found SQL remains the single most widely used language for production analytics and ML data pipelines — knowing it is the baseline, not the differentiator. In 2026, interviewers assume you can write a SELECT with JOIN and GROUP BY in your sleep. The bar is higher: window functions, CTEs, and query optimization are where most candidates get exposed.

Expect questions like these:

  • Top N per group. “Return the second-highest salary per department.” Almost every loop tests this. The canonical solution uses ROW_NUMBER() or DENSE_RANK() inside a CTE, then filters on rank = 2. Mention what changes if there are ties — DENSE_RANK keeps duplicates, ROW_NUMBER breaks them arbitrarily.
  • Year-over-year growth. Compute YoY revenue by month using LAG() with PARTITION BY month, ORDER BY year. Watch for gaps where a month is missing — that’s the edge case the interviewer is testing.
  • Sessionization. Given user events with timestamps, define a “session” as activity with no 30-minute gap. The trick is a window function that compares each event’s timestamp to LAG() and flags session boundaries.
  • Duplicate detection. Spot exact duplicates with ROW_NUMBER(), then near-duplicates by hashing a tuple of columns.
  • Cohort retention. Build a cohort table where rows are signup-month and columns are months-since-signup. Self-join the events table on user_id.

Talk through the edge cases out loud: NULLs in JOIN columns, time zones in timestamps, ties in rankings. A candidate who proactively handles NULLs scores notably higher than one who gets the right answer silently.

Statistics and probability questions

The statistics round in a data scientist interview tests two things: whether you can design an experiment, and whether you understand what your test is actually telling you. A/B testing dominates — KDnuggets’ 2025 interview guide notes that product-heavy interviews at Airbnb, Meta, and Google all include at least one experimentation round.

Standard A/B testing questions:

  • Sample size and power. Given a baseline conversion rate of 5% and a minimum detectable effect of 0.5 percentage points, how many users do you need per arm? Walk through alpha (typically 0.05), power (typically 0.8), and the formula. You don’t need to compute it exactly — naming the inputs is what matters.
  • Multiple comparisons. “We’re testing five variants. What changes?” Mention Bonferroni correction (divide alpha by the number of tests) or Benjamini-Hochberg for false discovery rate when you have many tests.
  • Novelty and primacy effects. Why might week-one results overstate the true effect? Be ready to talk about how long to run before reading out.
  • Sampling bias. Given a survey with 30% response rate, how would you check for non-response bias? Discuss stratified sampling and post-stratification weights.

Pure probability puzzles still appear — expected value, conditional probability, Bayes’ theorem on test sensitivity and specificity. Practice the classic “rare disease” Bayesian update; it shows up at almost every FAANG-tier company.

ML and modeling questions

ML rounds test judgment more than implementation. You won’t usually have to derive backpropagation on a whiteboard, but you’ll be asked to defend model choices and reason about failure modes.

Common prompts:

  • Bias-variance tradeoff in plain language. Why does a deeper tree overfit? How does regularization help? Connect it to concrete techniques: L1 versus L2, dropout in neural nets, max_depth in XGBoost.
  • Feature engineering. “How would you encode a high-cardinality categorical with 10,000 levels?” Discuss target encoding, hashing trick, and embeddings. Name when each fails.
  • Model evaluation. When does accuracy mislead? Walk through precision, recall, F1, AUC-ROC, and PR-AUC for imbalanced classes. Fraud and churn are the canonical examples.
  • Cross-validation. Why is K-fold inappropriate for time-series? Explain forward-chaining or expanding-window validation.
  • Picking a model. “I have 50,000 rows of structured tabular data and need explainability. What model?” The expected answer is gradient-boosted trees (XGBoost, LightGBM, CatBoost) with SHAP for explanations — not a neural network.

A real 2026 question pulled from KDnuggets: “Implement a scalable ML pipeline for continuous model updates.” This is increasingly a system-design-flavored question, especially in senior loops where MLOps and monitoring come up.

Case study and product sense questions

The case study round is where data science loops are won and lost. Prompts are intentionally vague: “Engagement on our app is flat — how would you investigate?” or “We’re launching a new feature next quarter. How would you measure success?”

A repeatable structure works better than cleverness:

  1. Clarify scope. What product, what user segment, what time window? Restating the question forces alignment.
  2. Define success metrics. Pick one primary metric and 2–3 guardrails. For engagement, that might be DAU/MAU plus session length and 7-day retention; the guardrails could be crash rate and notification opt-outs.
  3. List hypotheses. Brainstorm 4–6 reasons the metric moved. Order them by likelihood and ease of investigation.
  4. Specify the analysis. For each top hypothesis, name the SQL or experiment that would confirm or rule it out. Mention sample sizes and which segments you’d cut by.
  5. Recommend a next step. End with “based on what I find, I’d either run X experiment or escalate Y to product.”

Interviewers reward candidates who voluntarily pick a metric and defend it. The trap is jumping to “I’d build a model” — that’s almost never the right first move in a product case.

What hiring managers look for

The signal that gets candidates hired isn’t technical depth — it’s the ability to translate a business problem into a measurable analysis and back into a decision. Hiring managers consistently rank “product sense” and “communication” above raw modeling skill in post-loop debriefs.

Three behaviors stand out:

  • Picking the metric before the method. Strong candidates spend the first three minutes of any question on the metric. Weak candidates start with the model.
  • Naming tradeoffs out loud. “I’d use logistic regression here for explainability, even though XGBoost would likely give me 2–3 points more AUC, because the legal team needs to audit feature weights.” That sentence wins offers.
  • Quantifying past impact. Behavioral answers should land on a number: “lifted activation 4.2%,” “cut model training time from 6 hours to 40 minutes,” “saved $1.8M in annual fraud losses.” Generic claims of “improved performance” get scored as missing evidence.

Practice describing two or three past projects in 90 seconds each, with one metric in the result line.

Common mistakes

Three failure patterns show up in interview debriefs over and over:

  • Skipping the metric. Candidates dive into modeling before defining what success looks like. Interviewers note this in feedback as “weak product sense” even when the technical answer is correct.
  • Treating statistical significance as business significance. A p-value of 0.03 on a 0.1% lift is not a ship recommendation. Always pair statistical results with effect size and confidence intervals, and connect them to revenue or user impact.
  • Refusing to commit. When asked “which model would you use?” weak candidates list five options without picking one. Strong candidates pick one, defend it in two sentences, and acknowledge the conditions under which they’d switch.

Two smaller-but-fatal mistakes: forgetting to handle NULLs in SQL answers, and not asking clarifying questions before a case study. Both signal lack of rigor more than lack of knowledge. The fix is the same in both cases — slow down for the first 60 seconds, restate the problem, and only then start solving.

Frequently asked questions

How long is a typical data scientist interview loop in 2026?

Most loops run 4 to 7 rounds across 2 to 4 weeks: a recruiter screen, a hiring-manager call, a technical screen (usually SQL plus a stats or Python question), and a virtual onsite with 3 to 5 interviews covering modeling, case study, experimentation, and behavioral. DoorDash and TikTok both publish 7-step funnels; Meta and Microsoft typically land at 4–5 onsite rounds.

What SQL topics show up most often in data scientist interviews?

Window functions (ROW_NUMBER, LAG, LEAD, RANK), CTEs and recursive CTEs, self-joins for cohort analysis, and date-bucketing for retention or year-over-year growth. Duplicate detection and 'top N per group' problems are the single most common interview pattern across FAANG and Series-B startups alike.

How should I approach an A/B testing question?

Walk through the same framework every time: define the hypothesis and primary metric, choose the unit of randomization, estimate the sample size needed for the minimum detectable effect, name your significance level (usually alpha = 0.05) and power (usually 0.8), and call out guardrail metrics. Mentioning Bonferroni or Benjamini-Hochberg correction for multiple comparisons is a strong-signal move.

Do I need to know deep learning to pass a data scientist interview?

For most product or analytics DS roles, no — strong fundamentals in regression, tree-based models, and evaluation metrics will carry you. For research, recommendation, or applied-ML positions, expect at least one round on embeddings, transformers, and the basics of how LLMs are fine-tuned and served.

What's the difference between a data scientist and data analyst interview?

Analyst interviews lean heavier on SQL, dashboarding, and stakeholder communication. Data scientist interviews add hypothesis testing, ML modeling, and a case study round that asks you to design metrics or diagnose a metric drop. Both share product sense, but DS interviews go deeper on experimentation.

How do I prepare for the case study round?

Practice on loosely-framed prompts like 'DAU is flat — how would you investigate?' Build a habit of clarifying scope, listing 3–5 hypotheses, prioritizing by likelihood and impact, then describing the SQL or experiment you would run. Interviewers reward structure and metric-pickiness over clever answers.

What statistics concepts come up most often?

Central limit theorem, p-values and the difference between Type I and Type II error, confidence intervals, sampling bias, simple Bayesian reasoning, and the assumptions behind common tests (t-test, chi-square, Mann-Whitney). Expect at least one question on why a p-value is not the probability the null hypothesis is true.

How should I talk about ML projects in the behavioral round?

Pick a project with a measurable business outcome, not just a model metric. Lead with the decision the model influenced, the lift in the primary metric (revenue, retention, fraud caught), and one tradeoff you made — for example, choosing a simpler model for explainability or accepting more false positives to reduce missed fraud.

Are take-home assignments still common in 2026?

Yes, especially at startups and mid-size companies. Expect a 4 to 8 hour notebook task: clean a messy dataset, build a baseline model, and write a short memo. Spend at least a third of your time on the writeup — most candidates over-invest in model accuracy and under-invest in framing and recommendations.

What gets candidates rejected most often?

Three failure modes dominate: jumping to a model before defining the metric, treating statistical significance as business significance, and not narrating tradeoffs. Strong candidates verbalize what they would do differently with more data or more time — silence reads as overconfidence.

How important is Python versus R?

Python is now standard across roughly 95% of US data scientist roles. R still shows up in healthcare, biostatistics, and some research teams. If you only know R, plan to spend two weekends getting comfortable with pandas, scikit-learn, and basic matplotlib before applying broadly.

Should I mention LLMs and generative AI in my interview?

If the role touches them, yes — but only with substance. Saying 'we used GPT-4 for classification' lands flat; describing how you evaluated it against a fine-tuned baseline, measured hallucination rates, and chose a routing strategy lands well. Senior DS loops in 2026 increasingly include one LLM-flavored question.