How many rounds are in the Amazon Data Scientist interview loop?
The full loop is typically 5–6 virtual or onsite interviews following a recruiter screen and 1–2 technical phone screens. Each 45–60 minute session is owned by a different interviewer who evaluates both a technical domain and 2–3 assigned Leadership Principles.
What technical topics does Amazon test data scientists on?
Expect SQL (joins, window functions, aggregations), statistics and probability (hypothesis testing, A/B test design, distributions), machine learning (feature engineering, model selection, evaluation metrics), and occasionally Python or pandas for data manipulation.
How important are the Leadership Principles for a data scientist role at Amazon?
Extremely important. Amazon weights behavioral evidence on par with technical skill. Every interviewer in your loop is assigned specific Leadership Principles to probe, and the Bar Raiser can reject you even if you excelled in every technical round.
What is the Bar Raiser and how do I handle that round?
The Bar Raiser is a trained interviewer from outside the hiring team with veto authority over the decision. The round is primarily behavioral. Treat it like any other Leadership Principles interview — specific STAR stories, quantified results, first-person framing. Do not try to guess who the Bar Raiser is during your loop.
What level does Amazon typically hire data scientists at?
Most new-grad or early-career hires land at L4. Candidates with 3–6 years of industry experience typically interview for L5. L6 (Senior DS) requires demonstrated cross-functional impact and technical depth. Each level has a distinct hiring bar and compensation band.
How should I structure behavioral answers for Amazon?
Use STAR: Situation, Task, Action, Result. Always say 'I' not 'we' so the interviewer can credit you specifically. End with a quantified outcome — percentage lift, dollar impact, time saved — and optionally add what you learned.
Does Amazon ask A/B testing questions for data scientists?
Yes, experiment design is one of the heaviest topics. You may be asked to design a test from scratch, diagnose a sample ratio mismatch, or explain how you would handle novelty effect or network interference in a marketplace setting.
What SQL difficulty should I expect at Amazon?
Questions lean intermediate-to-hard: multi-table joins, window functions (RANK, LAG, LEAD), running totals, and filtering on aggregated results. You should be comfortable writing readable, correct SQL under time pressure without an IDE.

The Amazon data scientist interview is longer and more structured than most. From the moment a recruiter reaches out to the day you receive a decision, the process typically spans four to six weeks. Knowing the shape of each stage — and why Amazon runs it the way it does — is the single biggest advantage you can give yourself before prep begins.

The Amazon interview loop: stage by stage

Recruiter screen (30 minutes). A phone call to confirm your background, discuss the team, and explain the process. You may get a surface-level behavioral question (“Tell me about a time you worked with ambiguous data”). The recruiter is also gauging level fit — whether you’re being considered for L4, L5, or L6.

Technical phone screen (45–60 minutes, sometimes two). One or two virtual sessions with a data scientist or engineer on the team. Expect a live SQL or Python problem plus at least one stats or A/B testing conceptual question. Some teams add a second phone screen focused on machine learning. Performance here determines whether you advance to the loop.

The onsite loop (5–6 sessions, virtual or in-person). This is the main event. Each session runs 45–60 minutes and is owned by a single interviewer who has been assigned a technical domain and 2–3 Leadership Principles to evaluate. A typical loop structure looks like this:

  • Round 1 – SQL and data manipulation. Multi-step queries, window functions, aggregations. Expect to code live in a shared editor.
  • Round 2 – Statistics and experiment design. Hypothesis testing, power analysis, A/B test architecture, identifying threats to validity.
  • Round 3 – Machine learning. Model selection, feature engineering, evaluation metrics, bias-variance tradeoff, how you’d deploy and monitor a model in production.
  • Round 4 – Product and business case. A metric deep-dive or case study. You are given a business scenario and must identify the right measurement approach, propose analysis, and interpret a (sometimes intentionally misleading) result.
  • Round 5 – Behavioral / Leadership Principles. Pure behavioral with 2–3 principle questions. The Bar Raiser often appears in this slot, though you will not be told who they are.
  • Round 6 (optional) – Hiring manager. Some teams add a session with the hiring manager that blends behavioral and role-specific technical questions.

Debrief. All interviewers write independent written feedback before seeing each other’s assessments. The Bar Raiser facilitates a debrief where a consensus is reached. The Bar Raiser holds effective veto power over a “hire” recommendation.

What Amazon uniquely evaluates

Most tech companies say culture fit matters. Amazon operationalizes it through 16 Leadership Principles, each of which is assigned to one or more interviewers in your loop. Three of them come up with particular frequency for data science candidates:

Dive Deep. Amazon’s data culture is obsessive about ground truth. Interviewers will probe whether you’ve personally interrogated raw data rather than relying on dashboards, whether you’ve caught a flaw in an upstream metric definition, or whether you’ve pushed back on a business stakeholder’s interpretation of a p-value.

Invent and Simplify. Did you build a smarter experiment framework, replace a cumbersome modeling pipeline, or find a cheaper way to get the same analytical answer? Amazon wants engineers and scientists who reduce complexity, not add to it.

Are Right, A Lot. This one trips people up. It’s not about being confident — it’s about having good judgment under uncertainty. Interviewers listen for how you reason when data is incomplete, how you weigh conflicting signals, and whether you update your position when evidence changes.

The Bar Raiser’s explicit job is to assess whether you’d raise the overall quality of Amazon’s data science organization — not just whether you can do the specific role. They will push harder on your reasoning than your answers.

SQL round: what you’ll actually see

Amazon’s SQL questions are not trick puzzles. They test whether you can write readable, correct, efficient queries under time pressure. Common patterns:

Window functions. “Given a table of daily user sessions with timestamps, write a query that returns each user’s second-most-recent session.” You need ROW_NUMBER() or DENSE_RANK() partitioned by user, filtered on rank = 2.

Running totals and deltas. “Show me 7-day rolling revenue per seller.” This requires SUM() OVER (PARTITION BY seller_id ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW).

Self-joins or CTEs for gap/island problems. “Find customers who placed orders on two consecutive days.” Classic self-join or LAG/LEAD approach.

One real question reported by candidates: “Write a query to find the top 3 selling products in each category for the past 30 days, excluding products with fewer than 100 units sold.” This involves a CTE to aggregate, a WHERE on the aggregate, and RANK() partitioned by category.

Statistics and experiment design round

A/B testing is Amazon’s core decision-making tool — the company runs thousands of experiments per year across its marketplace, Prime, and advertising products. Interviewers expect you to speak to experimentation the way a practitioner does, not as a textbook exercise.

Common questions:

  • “How do you determine sample size before running an A/B test?” Walk through the inputs: baseline conversion rate, minimum detectable effect, desired power (typically 80%), and significance level (α = 0.05). Demonstrate that you understand the tradeoff — smaller MDE requires larger samples and longer runtime.
  • “You launch a test and the sample ratio is 80/20 instead of 50/50. What happened and what do you do?” This is a sample ratio mismatch (SRM). You stop the test, investigate randomization logic, check if the assignment cookie is being set correctly, and do not ship based on contaminated data.
  • “How would you design an experiment to test a new Prime recommendation algorithm, accounting for the fact that user behavior on day 1 might differ from day 30?” This probes novelty effect awareness and whether you’d run a holdback group over a longer window.
  • “How do you handle interference in a marketplace A/B test?” Amazon’s marketplace has supply-side and demand-side interactions — showing product X to the treatment group can affect inventory availability for the control group. You’d discuss graph-cluster randomization or synthetic control approaches.

Sample answer for a hypothesis testing question:

“What is the difference between Type I and Type II error, and when would you prioritize reducing one over the other?”

Type I error is a false positive — concluding an effect exists when it doesn’t. Type II error is a false negative — missing a real effect. In a medical trial context you’d set a strict α (0.01) to minimize false positives. In a low-stakes product test where you want to catch real lifts quickly, you might accept higher α and lower power threshold to iterate faster. At Amazon, if the cost of a bad launch is high (pricing algorithm, fraud detection), you tighten the significance threshold; for a low-risk UI test, 0.05 with 80% power is standard.

Machine learning round

Amazon uses ML extensively — recommendation systems, demand forecasting, fraud detection, search ranking, pricing optimization. Interviewers are not looking for you to recite model architectures. They want to see engineering judgment.

Common questions:

  • “Walk me through how you would build a model to predict whether a customer will return a product.” This is a full ML system design question: define the label (return within 30 days?), identify features (product category, price, review score, customer history), address class imbalance, choose a metric (precision vs. recall tradeoff at the threshold that matters for business), describe monitoring post-deployment.
  • “What’s the difference between precision and recall, and how do you choose a threshold?” Precision = true positives / (true positives + false positives). Recall = true positives / (true positives + false negatives). The threshold is a business decision — for fraud detection you’d optimize recall to catch more fraud at the cost of more false alarms; for a promotional email campaign you’d optimize precision to avoid annoying non-targets.
  • “How would you detect model degradation in production?” Monitor feature distributions for drift (PSI score), track prediction distribution over time, set up alerts on business KPIs that the model influences, and run periodic shadow tests against a baseline.
  • “Explain regularization and when you’d use L1 vs. L2.” L1 (Lasso) drives some coefficients to exactly zero, giving you automatic feature selection — useful when you have hundreds of sparse features. L2 (Ridge) shrinks all coefficients evenly and tends to perform better when many features have small but real effects. Elastic Net combines both.

Behavioral round: Leadership Principles in practice

Prepare 12–15 STAR stories before your loop. Each story should be specific enough that it can’t be applied to every principle, and should include a concrete result. Principles most frequently cited for data science roles: Dive Deep, Invent and Simplify, Are Right A Lot, Customer Obsession, Bias for Action, and Ownership.

“Dive Deep” example question: “Tell me about a time you discovered a flaw in data that others had accepted as accurate.”

A strong answer names the data set, describes how you found the issue (anomaly in a distribution, inconsistency across systems, a metric that didn’t add up), what you did to investigate (querying raw logs, talking to the data engineering team, running reconciliation scripts), and what changed as a result — ideally a decision that was reversed or a pipeline that was corrected.

“Invent and Simplify” example question: “Tell me about a time you found a simpler way to solve a complex analytical problem.”

One pattern that lands well: you inherited a sprawling, fragile modeling pipeline and replaced it with a cleaner design — fewer features, a simpler model that performed comparably — reducing compute cost and making the model’s behavior interpretable to stakeholders. Quantify both the technical improvement (e.g., “reduced inference time by 60%”) and the business impact.

Things that sink behavioral answers:

  • Saying “we” throughout without clarifying your individual role.
  • Describing what you would do hypothetically instead of what you actually did.
  • Trailing off without a measurable result.
  • A result that’s just “the project was successful” — interviewers are trained to probe until they get a number.

Level and compensation context

Amazon hires data scientists at L4 through L7 in most business units. Based on data from Levels.fyi and compensation reporting through late 2025:

  • L4 (Data Scientist I): Total compensation starting around $179K. Typical for new grads or candidates with 0–2 years of industry experience.
  • L5 (Data Scientist II): Approximately $225K total. The most common hiring level for candidates with 3–5 years of experience and demonstrated ownership of end-to-end projects.
  • L6 (Senior Data Scientist): Approximately $336K total. Requires evidence of technical depth and organizational influence — you’ve shaped team roadmaps, mentored others, and driven decisions that crossed team boundaries.
  • L7 (Principal Data Scientist): Approximately $567K total. Leadership-track role; candidates typically have 10+ years of experience with a track record of defining strategy at the org level.

Compensation is heavily weighted toward RSUs at L5 and above. Location matters — Seattle and San Francisco bands run higher than other metros. The recruiter will tell you the level before the loop in most cases; if they don’t, ask.

Your six-week prep plan

Weeks 1–2: SQL and stats. Grind 20–30 SQL problems at medium-to-hard difficulty (DataLemur and StrataScratch both have Amazon-tagged questions). Review window functions until you can write PARTITION BY ... ORDER BY ... ROWS BETWEEN without hesitation. For stats, work through hypothesis testing, power analysis, and the core A/B testing failure modes from scratch.

Weeks 3–4: Machine learning and system design. Practice ML system design end-to-end: problem framing → data → features → model choice → evaluation → deployment → monitoring. Pick 3–4 business problems you understand well and rehearse designing an ML solution aloud.

Weeks 5–6: Leadership Principles and mock loops. Write out your 12–15 STAR stories. Map each to at least two Leadership Principles. Do at least two full mock interviews where someone asks you LP questions and pushes back on your results and reasoning. Record yourself once — people are often surprised by how vague their answers sound out loud.

Track everything you’re applying to, every LP story you’ve drafted, and every follow-up task using a system that keeps it all in one place — the loop is long, and the cognitive overhead of managing an active search while prepping is real.

A note on Amazon’s hiring timeline

The decision process after your loop typically takes one to three weeks. Interviewers submit written feedback independently, the debrief happens (sometimes asynchronously over email at Amazon), and the Bar Raiser facilitates consensus. If you don’t hear within ten business days, follow up with your recruiter — it’s professional and expected. Silence usually means the debrief is ongoing, not that you’ve been rejected.