20VC: Scale's Alex Wang on Why Data Not Compute is the Bottleneck to Foundation Model Performance, Why AI is the Greatest Military Asset Ever, Is China Really Two Years Behind the US in AI and Why the CCPs Industrial Approach is Better than Anyone Else's

20VC · Harry Stebbings — Alex Wang · June 12, 2024 · Original

Most important take away

Compute is no longer the binding constraint on foundation model progress — data is. The next leap from GPT-4 to GPT-10 will come from “frontier data” (complex reasoning chains, agent traces, tool use, expert thinking that was never written down on the internet), and the labs and nations that build the means of producing that data will win both economically and geopolitically.

Summary

Alex Wang’s central argument is that AI progress rests on three pillars — compute, algorithms, and data — and that the industry has scaled compute exponentially since GPT-4 without scaling the other two, producing the visible plateau in model quality. The internet has been fully mined; what remains is “frontier data” — the reasoning, deliberation, and tool-use traces that experts never write down. Producing it requires a hybrid human + synthetic pipeline modeled on autonomous-vehicle safety drivers, where AI generates and humans correct.

Actionable insights and patterns:

Data, not compute, will be the durable moat for foundation labs. Expect labs to brag about data rights the way they currently brag about GPU counts. Anthropic-style enterprise data strategies and OpenAI’s FT / Axel Springer deals are early indicators.
Two parallel data motions matter: (1) one-time mining of existing enterprise data (JP Morgan alone has 150 PB vs. GPT-4’s <1 PB training set) and (2) ongoing forward production of frontier data via expert-in-the-loop pipelines.
Reasoning is largely a data problem in disguise — for any domain you want a model to reason in, you need dense data of that scenario. General reasoning breakthroughs are a separate, longer bet.
Enterprises will increasingly demand on-prem / open-weight models (Llama, Mistral) because their proprietary data is their only AI-era differentiator; sharing it with a model vendor risks “mortgaging their future.” Expect a real on-prem reversion.
Software is shifting from walled-garden SaaS to a “constellation” of custom, purpose-built apps (Palantir-style, but feasible for everyone because gen-AI collapses build costs). Per-seat pricing breaks when agents do the work — move to consumption pricing.
Software engineering careers: rote coding compresses; the durable skill is translating customer/business problems into well-scoped engineering tickets an AI engineer can execute. Position yourself at that translation layer.
High-leverage career path for technical experts (mathematicians, doctors, scientists, senior engineers): contributing frontier data to models multiplies your expertise across every future model call. Wang frames “AI trainer / contributor” as one of the highest-leverage jobs available.
Value capture in AI is moving constantly — read Andy Grove’s High Output Management. The model layer itself may capture little; infrastructure below and applications/services above will capture more. AI services revenue will likely exceed AI model revenue over the next five years (cf. Accenture).
Geopolitics: China is roughly neck-and-neck (e.g., 01.AI’s Yi-Large), not two years behind. The CCP excels at “turning the crank” on industrial policy (solar, EVs) and a permissive data regime could let them race ahead. The West needs a pro-data regulatory stance — pooled industry data (aerospace safety, financial-fraud, HIPAA-bounded medical) without weakening liberal-democracy values.
Open vs. closed models: a dichotomy is inevitable — frontier systems stay closed for national-security reasons (AGI as a greater military asset than nukes); sub-frontier open models like Llama 3 are fine and economically valuable.
Foundation-model layer will consolidate to nation-states and hyperscalers because per-model costs are heading to tens or hundreds of billions. Watch how the OpenAI/Microsoft and Anthropic/Amazon partnerships actually play out.

Company-building principles from Wang:

“Best PR is no PR.” Traditional press optimizes for clicks (build up, tear down). Build direct distribution (podcasts, owned channels). Personalities scale better than company brands — people follow Sam Altman more than OpenAI; invest in founder voice.
Hire people who “give a shit” — A-players who sweat every detail. Aim for the Navy SEALs, not the Navy. Wang still personally reviews every hire at ~800 people and overrides hiring-manager recommendations 25–30% of the time.
Biggest leadership mistake: conflating company hyper-growth with team hyper-growth. Scale went 150 → 700+ in 2020–2022 and the bar quietly eroded. Since end of 2022 they’ve held headcount at ~800 while revenue grew dramatically — a deliberate density play (Airbnb / Chesky pattern).
Brand heat is cyclical; build a self-preserving talent ecosystem so quality doesn’t depend on whether you’re currently “hot.” Best hires often come during the cold periods.
Manage from freedom (trust + autonomy) rather than fear; identify which mode each report responds to.
Watch the AV-cycle analog for gen-AI: over-promising relative to technical reality created the AV trough. Same risk now — make measured public promises.

Chapter Summaries

Diminishing returns and the data wall: Despite Nvidia data-center revenue going from ~$5B to >$20B per quarter, no model has materially surpassed GPT-4. Cause: pre-training has consumed the easy internet data; compute is scaling without data scaling.
What frontier data is and why it matters: Complex reasoning chains, agent traces, tool use, expert deliberation. Most economically valuable thinking never gets written on the internet, so models can’t learn it from crawls.
Where new data comes from: (1) enterprise mining (one-time but huge — 150 PB at JP Morgan), (2) longitudinal capture (process mining at work, devices like Limitless / Meta Ray-Bans for consumers), (3) expert + AI hybrid frontier-data production.
Data as the durable moat: Among compute/algorithms/data, only data offers sustainable advantage. Labs will increasingly differentiate via data strategies (OpenAI–FT/Axel Springer, Anthropic on enterprise data).
On-prem reversion: Enterprises see their proprietary data as their only AI-era moat; open-weight models on-prem will see strong demand.
Value capture across the stack: Model layer commoditizes; infra (Nvidia) and apps/services capture value. End-of-software thesis — constellation of custom apps replacing walled-garden SaaS.
Engineering teams and pricing: Engineers shift toward problem-translation work; per-seat pricing dies in favor of consumption pricing as agents do the work.
Regulation and geopolitics: EU is too restrictive; US lacks a pro-data stance. Pool safety / fraud / medical data with anonymization. China is neck-and-neck and excels at industrial scale-up.
AI as military asset: Potentially greater than nukes — informs the open/closed dichotomy. Frontier systems must stay closed; sub-frontier open-source is fine.
Foundation-model consolidation: Only nation-states and hyperscalers can underwrite future training runs.
Company-building: No-PR strategy, founder-led narrative, hiring “Navy SEALs,” 25–30% override rate on hires, divorcing team growth from revenue growth.
Lightning round: Biggest mind-change is on hyper-growth/team density; biggest AI misconception is “compute is all you need”; dream board member is Satya Nadella; cautionary tale is the AV hype cycle.