← All summaries

[LIVE] Anthropic Distillation & How Models Cheat (SWE-Bench Dead) | Nathan Lambert & Sebastian Raschka

Latent Space · Latent.Space, Nathan Lambert, and Sebastian Raschka, PhD · February 26, 2026

Chapter Summaries

Chapter 1 — Anthropic’s Distillation “Attack” Blog Post

Anthropic published a blog post detailing how they detected Chinese AI labs systematically querying their API at scale to generate synthetic training data — a practice termed “distillation.” The blog specifically named Minimax (which redirected ~half its API traffic when Anthropic released Claude Opus 4.6) and implicated DeepSeek (though at a far smaller scale). Nathan Lambert notes Anthropic’s framing as an “attack” fits their geopolitical branding, and he finds it unsurprising given Chinese labs are under severe GPU shortage and using publicly available APIs for synthetic data is far easier than generating it independently. Sebastian Raschka explains: distillation means training a smaller model on the outputs of a larger model — originally done on raw logits, now more loosely applied to any synthetic Q&A data from a frontier model. This is extremely common inside labs (e.g., DeepSeek R1 trains its smaller variants on outputs of the full 671B model).

Chapter 2 — How Distillation Is Detected (and Why It’s Hard)

Detecting distillation is tricky: evaluating a model and generating training data look almost identical at the API level. The key signals labs look for are: (1) very high query volume with repetitive question distributions (benchmark-shaped vs. broad coverage), (2) sudden traffic shifts when new models release (Minimax’s traffic pivoted to the new Claude release immediately, making attribution clear), and (3) pattern analysis across linked accounts spreading to circumvent rate limits. The discussion raises a privacy tension: detecting distillation requires monitoring what users are generating, which is uncomfortable territory. The group notes Anthropic has previously blocked OpenAI and XAI from using their API. The practical conclusion: true-scale distillation (100B+ tokens) takes time and is hard to disguise, but light evaluation-scale use (millions of API calls) is nearly impossible to differentiate from legitimate benchmark runs.

Chapter 3 — SWE-Bench Verified Is Dead

SWE-Bench (a coding benchmark using real GitHub issues from popular open-source repos) became the dominant agentic coding leaderboard, going from ~13% at launch to 80%+ across all frontier models. OpenAI’s curation of SWE-Bench Verified (a human-vetted 500-task subset, costing several million dollars) was supposed to fix quality issues. But OpenAI’s own audit revealed: (1) ~59% of the remaining ~20% hard cases are literally unsolvable — the benchmark tasks were underspecified or broken, (2) GPT-5 was caught using knowledge from future versions of libraries to solve problems (because the problems are from open-source repos, and models trained on GitHub absorb all future versions of those APIs), and (3) Gemini/Opus benchmarks revealed models that could regenerate the entire problem statement from just a task ID — confirming near-perfect memorization of the test set. The benchmark’s saturation (~80% for nearly every frontier model, including smaller/cheaper ones that should clearly be worse) confirms that variance from memorization and cheating has swamped signal from actual capability.

Chapter 4 — SWE-Bench Pro and the Future of Benchmarking

SWE-Bench Pro attempts to fix SWE-Bench Verified with updated time ranges (pulling from more recent GitHub issues), diversified repositories/languages, and private/public data splits. However, the hosts are skeptical that any new benchmark avoids these problems permanently — the pattern is that careful initial validation still misses issues that are only exposed years later. Scale AI has commercial interest in making SWE-Bench Pro a strong benchmark. The group agrees: coding benchmarks will remain relatively tractable because code has objective test-pass/fail evaluation, but benchmarking for computer use, UI tasks, and multimodal agentic workflows (the next frontier) will be far harder to evaluate and will likely require “GDP-eval”-style macro economic measurement. The smallest increment on 500-task benchmarks is 0.2%, so sub-0.2% improvement claims are largely noise.


Summary

This live session covers two intertwined topics dominating the AI engineering community: the distillation/API scraping controversy and the collapse of SWE-Bench as a reliable benchmark.

Key themes and actionable insights:

Distillation is happening, is nearly undetectable at small scale, and is inevitable. Every serious AI lab has economic incentive to generate synthetic data from frontier models. The practical effect: Anthropic’s public blog post is more geopolitical marketing than an enforceable legal position. For engineers building on top of LLM APIs, the key takeaway is that terms of service prohibiting training competitors on API outputs are only meaningfully enforceable against flagrant, large-scale, detectable abuse — not against routine synthetic data generation. Tools like OpenRouter (mentioned by Raschka as his preferred routing service) provide API access to open-weight models (DeepSeek, etc.) that have permissive terms, making large-scale distillation accessible and legally safer for practitioners.

SWE-Bench Verified is no longer a reliable signal for comparing frontier models. The 80%-saturation problem means nearly all frontier models cluster within noise, including models that clearly perform differently in real use. For practitioners choosing LLMs for coding tasks: don’t rely on SWE-Bench Verified scores to differentiate frontier models. Use real task performance on your actual codebase, or wait for SWE-Bench Pro’s private leaderboard to mature. The fact that a model can be “prompted” to spit out an entire benchmark task from its task ID (pure memorization) should make practitioners suspicious of any coding benchmark number until private evaluation becomes the standard.

Benchmark cheating is a structural problem, not a moral failure. The models don’t intentionally cheat — they absorb public GitHub code during pre-training (including unit tests that are the benchmark’s answer key), and there’s no clean way to prevent this without fully private, never-published evaluation sets maintained by neutral parties (like Scale AI’s private set for SWE-Bench Pro). For teams building internal evaluations: this is a strong argument for keeping your internal test suite private and never publishing it, and for using canary tasks (intentionally unsolvable or memorization-detectable tasks) to catch model overfitting.

No stocks or investments were specifically discussed or recommended in this episode.

Career/technical advice: Nathan Lambert highlighted that evaluations (“evals”) at the frontier are becoming structurally expensive — from millions to tens/hundreds of millions of dollars for frontier-grade evals. This creates a meaningful commercial opportunity for trusted third-party evaluation providers. Engineers who can build rigorous, private, manipulation-resistant evaluation pipelines for non-coding domains (UI, computer use, multimodal) are positioned at a high-value gap in the current ecosystem.