LLM Benchmarks
Performance metrics for leading large language models across standardised benchmarks.
import polars as pl
import polarise
from polarise.datasets import get_llm_benchmarks
df = get_llm_benchmarks()
Multi-column gradient
{ cmap="imola" · cmcrameri }
(df.style()
.gradient(["MMLU", "HumanEval", "GPQA"], cmap="imola")
.fashion_minimal()
.title("LLM Benchmark Scores")
.show()
)
LLM Benchmark Scores
| Model | MMLU | HumanEval | GPQA | Context_k | Cost_per_1M |
|---|---|---|---|---|---|
| GPT-4 | 86.4 | 67.0 | 50.1 | 128 | 30.0 |
| Claude-3.5 Sonnet | 88.7 | 92.0 | 59.4 | 200 | 3.0 |
| Gemini-1.5 Pro | 85.9 | 71.9 | 46.2 | 2048 | 7.0 |
| Llama-3-70B | 82.0 | 81.7 | 42.7 | 8 | 0.9 |
| Mixtral-8x7B | 70.6 | 40.2 | 34.3 | 32 | 0.7 |
Highlight best in class
(df.style()
.highlight_max("MMLU")
.highlight_max("HumanEval", fill="lightblue")
.highlight_min("Cost_per_1M", fill="lightgreen")
.fashion_grid()
.title("Best in Class")
.footnote("Cost per 1M tokens. Lower is better.")
.show()
)
Best in Class
| Model | MMLU | HumanEval | GPQA | Context_k | Cost_per_1M |
|---|---|---|---|---|---|
| GPT-4 | 86.4 | 67.0 | 50.1 | 128 | 30.0 |
| Claude-3.5 Sonnet | 88.7 | 92.0 | 59.4 | 200 | 3.0 |
| Gemini-1.5 Pro | 85.9 | 71.9 | 46.2 | 2048 | 7.0 |
| Llama-3-70B | 82.0 | 81.7 | 42.7 | 8 | 0.9 |
| Mixtral-8x7B | 70.6 | 40.2 | 34.3 | 32 | 0.7 |
Cost per 1M tokens. Lower is better.
Cost vs context window
{ cmap="managua" · built-in or cmcrameri }
(df.style()
.gradient_divergent("Cost_per_1M", center=15.0, cmap="managua")
.bar("Context_k", fill="lightgreen")
.fashion_zebra()
.title("Cost vs Context Window")
.show()
)
Cost vs Context Window
| Model | MMLU | HumanEval | GPQA | Context_k | Cost_per_1M |
|---|---|---|---|---|---|
| GPT-4 | 86.4 | 67.0 | 50.1 | 128 | 30.0 |
| Claude-3.5 Sonnet | 88.7 | 92.0 | 59.4 | 200 | 3.0 |
| Gemini-1.5 Pro | 85.9 | 71.9 | 46.2 | 2048 | 7.0 |
| Llama-3-70B | 82.0 | 81.7 | 42.7 | 8 | 0.9 |
| Mixtral-8x7B | 70.6 | 40.2 | 34.3 | 32 | 0.7 |