← All summaries

The PhD students who became the judges of the AI industry

Equity · Rebecca Bellan — Anastasios Angelopoulos, Wayland Shang · March 18, 2026 · Original

Most important take away

Arena (formerly LM Arena) has become the de facto standard for evaluating AI models by using real-world user interactions rather than static benchmarks, making it nearly impossible for models to “overfit” to the test. Their structural neutrality — where scores are determined entirely by user votes processed through an open-source pipeline — has earned trust from major AI labs and investors alike, propelling the company to a $1.7 billion valuation in just seven months.

Chapter Summaries

Introduction and Founder Backgrounds Rebecca Bellan introduces Arena co-founders Anastasios Angelopoulos (CEO) and Wayland Shang (CTO). Both were PhD students at UC Berkeley who started the project in early 2023 as a research effort to evaluate LLMs in the real world after ChatGPT launched, building a platform where users compare anonymized model responses.

How Arena Differs from Static Benchmarks Anastasios explains the core differentiator: static benchmarks (like Humanity’s Last Exam) suffer from overfitting once models memorize the questions. Arena receives hundreds of thousands of fresh conversations daily from tens of millions of users, making overfitting impossible. 28% of users do coding, and the platform covers legal, medical, academic, and creative tasks.

Reproducibility and Statistical Rigor The team addresses concerns about reproducibility. While individual user prompts vary, the leaderboard is a statistical estimator with confidence intervals that converges to reliable results at scale, similar to A/B testing methodology.

Neutrality and Preventing Gaming Arena requires that any model submitted for public scoring must be the exact same model released to production. The public leaderboard never involves money — companies cannot pay for placement or removal. They have dedicated teams monitoring voting patterns, mouse clicks, and user reputation to detect bots and biased voting.

Enterprise Product and Data Moat Arena is expanding into paid enterprise tooling, helping companies evaluate which AI models best suit their specific use cases (e.g., legal work). With 5 million+ monthly users across 150+ countries and 60 million conversations per month, their data moat and network effects create a significant competitive advantage.

Expanding Beyond Chat to Agents The platform now evaluates agentic AI capabilities including web application building, multi-language coding, presentations, image/video editing, deep research, and shopping agents. They launched “Coder Arena” for coding agent evaluation and plan to expand to more agentic use cases.

Style Control and Responsible Benchmarking Arena developed “style control” methodology to factor out superficial qualities like response length and markdown formatting, ensuring the leaderboard measures genuine utility rather than just “vibes.” They regularly open-source their human preference data on Hugging Face.

Summary

  • Actionable insight for AI product builders: If you are selecting which AI model to integrate into your product, use Arena’s category-specific leaderboards rather than relying on static benchmarks or headline rankings. The platform breaks down performance by occupation, domain (legal, medical, coding), and task type, so you can find the model that actually performs best for your specific use case.

  • Career advice: The founders’ trajectory — from PhD students building a research prototype to running a $1.7B company in under three years — illustrates how identifying a genuine infrastructure gap in a fast-moving industry can create enormous opportunity. They built credibility through open-source contributions and academic rigor before commercializing.

  • Investment signal: Arena raised a $100M seed followed by a $150M Series A within seven months, backed by A16Z, Kleiner Perkins, Lightspeed, and notably by the very AI labs they evaluate (OpenAI, Meta, Anthropic, Google). This signals that AI evaluation infrastructure is now a venture-scale category. The company’s moat comes from network effects — more users produce better data, which produces more trusted rankings, which attracts more users.

  • Actionable insight for enterprises: Arena is building analytical tools for enterprise customers to evaluate models against their specific workflows. Rather than running expensive internal evaluations, companies can leverage Arena’s existing dataset of real-world interactions to make faster model selection and upgrade decisions as new models release every few weeks.

  • Key model insight: Anthropic’s Claude is currently leading Arena’s expert-domain leaderboard (legal, medicine, business), where 5-6% of Arena’s users are verified domain experts. This is worth noting for anyone selecting models for professional/specialized applications.

  • Style control matters: Arena’s default leaderboard now uses “style control” to strip out superficial factors like verbose responses and heavy markdown formatting. When evaluating models, look at the style-controlled scores for a more accurate picture of genuine capability rather than presentation quality.