Which LLM is really the best?

Test them on any task in WebDev Arena

Tereza Tizkova
4 min readDec 11, 2024

We have all seen many benchmarks comparing LLMs. If you’re an AI enthusiast, you have probably come across something like this:

…or this:

But how often do you actually get to see them perform in comparison?

Try WebDev Arena

WebDev Arena is an arena where two LLMs compete to build a web app. It’s free and very simple:

  • You write a task (e.g., “Build a productivity tracker” or “Build a photographer portfolio website”)
  • Two randomly chosen LLMs complete the task and show you the result
  • Without knowing which LLMs were competing, you voted for whatever result you liked better
  • You are revealed which LLM created which result.

This AI battlefield really helped me finally see the models in comparison. It helped me verify what I had been thinking for a while — for example, that Claude Sonnet is really great. Or that Qwen Coder is surprisingly good at coding-heavy tasks.

How it works under the hood

WebDev arena is built with Next.js, Tailwind CSS, and AI SDK by Vercel. But most interestingly, it uses E2B to make the outputs from the LLMs actionable.

When the LLM is given a task (e.g. “Build AI data Analyst app”), it starts planning it and then generating the code needed for that. But how is the code transformed into the result? That’s where E2B comes. The code is automatically executed in the “E2B Sandbox” — an isolated cloud environment.

That way we can immediately see the result. The E2B Sandbox is a small isolated VM that can be started very quickly (~150ms). You can think of it as if the LLM had its own small computer in the cloud.

Moreover, you can run many sandboxes at once, in this case, a separate sandbox for each task.

Models competing

Right now, the WebDev Arena offers four different models competing against each other. Each time two of them are randomly chosen.

My results

There are no official results as far as I know, but I made my own little experiment.

See my Excel sheet for the latest version.

I ran different prompts many times. Each LLM competing has its own row indicating how many times it won against LLM in each column. (And opposite, each column shows how many times the given LLM lost against LLM in each row.)

My experiment shows Claude as the clear winner.

Prompts I tested:

  • Create a copy of the Linear app
  • Create a LinkedIn profile of the cringiest AI influencer ever
  • Visualize an approximation of pi
  • Create calculator
  • Create a drawing program
  • Create a flight booking app
  • Build AI Data Analyst app
  • Build evals platform for AI agents
  • Design a dating app for programming languages
  • Build a personal finance tracker with budget visualization
  • Build an interactive periodic table
  • Design a galaxy exploration visualization
  • Create a task management system with a Kanban board

Selection of results:

What do you think?

If you want to join the discussion or stay under for more content on this topic, here is my X and LinkedIn profile.

Follow also E2B on X and LinkedIn.

--

--

Tereza Tizkova
Tereza Tizkova

Written by Tereza Tizkova

My blog is moving to ⚡️ https://terezatizkova.substack.com/ ⚡️ ✴️ Hiring engineers: e2b.dev/careers ✴️

Responses (2)