
SWE-bench
This collection of several thousand software engineering challenges evaluates how well a model solves programming problems. The developers created it by selecting a number of issues and corresponding pull-requests from a dozen or so Python projects. After some limitations appeared, the creators expanded the set by creating SWE-Bench+, SWE Bench Verified, and SWE-Bench Pro.
LMSYS Chatbot Arena
Instead of creating a fixed set of test prompts, the Large Model Systems Organization’s Chatbot Arena is a dynamic system that feeds the same prompt to different models and then asks humans to pick the best results. These head-to-head contests produce an Elo-like rating that is similar to the one used to score chess players.
Price
The rest of these metrics are useful, but as the real estate agents say, the three most important numbers on a property listing are price, price, and price. The cost is a bit less important for measuring AIs, but only a bit. Price can make a huge difference between a project being profitable and a moneysink. When the cost for each inference is a tad too high, it’s impossible to make it up with volume.

