Skip to content

LLM Benchmarks

Performance metrics for leading large language models across standardised benchmarks.

import polars as pl
import polarise
from polarise.datasets import get_llm_benchmarks
df = get_llm_benchmarks()

Multi-column gradient

{ cmap="imola" · cmcrameri }

(df.style()
   .gradient(["MMLU", "HumanEval", "GPQA"], cmap="imola")
   .fashion_minimal()
   .title("LLM Benchmark Scores")
   .show()
 )
LLM Benchmark Scores
Model MMLU HumanEval GPQA Context_k Cost_per_1M
GPT-4 86.4 67.0 50.1 128 30.0
Claude-3.5 Sonnet 88.7 92.0 59.4 200 3.0
Gemini-1.5 Pro 85.9 71.9 46.2 2048 7.0
Llama-3-70B 82.0 81.7 42.7 8 0.9
Mixtral-8x7B 70.6 40.2 34.3 32 0.7

Highlight best in class

(df.style()
   .highlight_max("MMLU")
   .highlight_max("HumanEval", fill="lightblue")
   .highlight_min("Cost_per_1M", fill="lightgreen")
   .fashion_grid()
   .title("Best in Class")
   .footnote("Cost per 1M tokens. Lower is better.")
   .show()
 )
Best in Class
Model MMLU HumanEval GPQA Context_k Cost_per_1M
GPT-4 86.4 67.0 50.1 128 30.0
Claude-3.5 Sonnet 88.7 92.0 59.4 200 3.0
Gemini-1.5 Pro 85.9 71.9 46.2 2048 7.0
Llama-3-70B 82.0 81.7 42.7 8 0.9
Mixtral-8x7B 70.6 40.2 34.3 32 0.7
Cost per 1M tokens. Lower is better.

Cost vs context window

{ cmap="managua" · built-in or cmcrameri }

(df.style()
   .gradient_divergent("Cost_per_1M", center=15.0, cmap="managua")
   .bar("Context_k", fill="lightgreen")
   .fashion_zebra()
   .title("Cost vs Context Window")
   .show()
 )
Cost vs Context Window
Model MMLU HumanEval GPQA Context_k Cost_per_1M
GPT-4 86.4 67.0 50.1 128 30.0
Claude-3.5 Sonnet 88.7 92.0 59.4 200 3.0
Gemini-1.5 Pro 85.9 71.9 46.2 2048 7.0
Llama-3-70B 82.0 81.7 42.7 8 0.9
Mixtral-8x7B 70.6 40.2 34.3 32 0.7