LLM Benchmarks

Performance metrics for leading large language models across standardised benchmarks.

import polars as pl
import polarise
from polarise.datasets import get_llm_benchmarks
df = get_llm_benchmarks()

Multi-column gradient

{ cmap="imola" · cmcrameri }

(df.style()
   .gradient(["MMLU", "HumanEval", "GPQA"], cmap="imola")
   .fashion_minimal()
   .title("LLM Benchmark Scores")
   .show()
 )

LLM Benchmark Scores

Model	MMLU	HumanEval	GPQA	Context_k	Cost_per_1M
GPT-4	86.4	67.0	50.1	128	30.0
Claude-3.5 Sonnet	88.7	92.0	59.4	200	3.0
Gemini-1.5 Pro	85.9	71.9	46.2	2048	7.0
Llama-3-70B	82.0	81.7	42.7	8	0.9
Mixtral-8x7B	70.6	40.2	34.3	32	0.7

Highlight best in class

(df.style()
   .highlight_max("MMLU")
   .highlight_max("HumanEval", fill="lightblue")
   .highlight_min("Cost_per_1M", fill="lightgreen")
   .fashion_grid()
   .title("Best in Class")
   .footnote("Cost per 1M tokens. Lower is better.")
   .show()
 )

Best in Class

Model	MMLU	HumanEval	GPQA	Context_k	Cost_per_1M
GPT-4	86.4	67.0	50.1	128	30.0
Claude-3.5 Sonnet	88.7	92.0	59.4	200	3.0
Gemini-1.5 Pro	85.9	71.9	46.2	2048	7.0
Llama-3-70B	82.0	81.7	42.7	8	0.9
Mixtral-8x7B	70.6	40.2	34.3	32	0.7

Cost per 1M tokens. Lower is better.

Cost vs context window

{ cmap="managua" · built-in or cmcrameri }

(df.style()
   .gradient_divergent("Cost_per_1M", center=15.0, cmap="managua")
   .bar("Context_k", fill="lightgreen")
   .fashion_zebra()
   .title("Cost vs Context Window")
   .show()
 )

Cost vs Context Window

Model	MMLU	HumanEval	GPQA	Context_k	Cost_per_1M
GPT-4	86.4	67.0	50.1	128	30.0
Claude-3.5 Sonnet	88.7	92.0	59.4	200	3.0
Gemini-1.5 Pro	85.9	71.9	46.2	2048	7.0
Llama-3-70B	82.0	81.7	42.7	8	0.9
Mixtral-8x7B	70.6	40.2	34.3	32	0.7