Methodology

How KernelArena evaluates AI models on kernel generation.

Overview

Frontier models are getting surprisingly good at writing AI accelerator kernels — but no single model consistently dominates. The performance landscape is the cross product of hardware target, kernel type, precision format, tensor shapes, and more — a space large enough that vibes-based evaluation misses most of it. Proper measurement requires fault-tolerant agent harnesses, access to real hardware, and benchmarking strict enough to catch the many creative ways models reward-hack their way to inflated scores. Individual researchers shouldn't have to shoulder all of that just to pick a model.

KernelArena evaluates how well frontier language models can generate high-performance kernels. Each task specifies exact shapes, dimensions, data types, and operator configurations — the kind of narrow, hardware-specific detail that hand-tuned kernels are written for today. We believe LLMs will soon generate specialized kernels for any given configuration on demand, and KernelArena is built to measure progress toward that future.

We evaluate both the model and the harness it runs in. Each benchmark suite specifies its own harness and hardware target — the details are documented per suite below.

When we discover reward hacks — cases where models exploit the evaluation to inflate scores without genuine performance gains — we document and expose them publicly. Honest benchmarks require honest failure modes.

We're building toward a fully open-source evaluation ecosystem where the community can inspect, swap, and contribute entire harnesses. We plan to add support for additional harnesses, models, and hardware targets so results reflect the broadest possible picture of model capability.

Benchmark Suites

WaferBench NVFP4 B200

NVIDIA B200 · CUDA 12.8
Last edited 2026-03-11

6 fused NVFP4 inference kernels (Add+RMSNorm, SiLU+Mul, Quantize) benchmarked against FlashInfer on NVIDIA B200

Harness — Cursor

Each model is evaluated through Cursor's agentic coding interface (Composer) — a common development flow that gives every model the same agentic workflow: receive the task spec, generate CUDA code, and iterate on compilation and correctness errors. We plan to onboard model-native harnesses (Claude Code, Codex, etc.) soon.

Correctness Validation

Each kernel is validated against FlashInfer (public API — e.g. flashinfer.add_rmsnorm_fp4quant, flashinfer.fp4_quantize) — the production code path on B200, not an internal backend. Correctness is checked via FlashInfer's own pytest suite at the exact benchmark shape and dtype (BF16). All tasks use NVFP4 format (block_size=16, E4M3 scales). Tolerances are FlashInfer's upstream values (unmodified).

Scoring

score = (1 × correctness) + speedup over reference. Mean and median speedups are computed across all correct kernels. Pass@1 measures fraction of kernels correct on the first attempt.

Benchmarking

Every task is fully pinned — the CI/CD runs an identical configuration each time on 8× NVIDIA B200. Tasks are intentionally narrow and hardware-specific (e.g. a single task might fix BLOCK_SIZE = 16, eps = 1e-6, scale_format = "e4m3", and compile with nvcc -O3 -use_fast_math).

Timing follows the ThunderKittens 2.0 convention: 500 warmup iterations, 100 timed reps via CUDA events, L2 cache input cycling. Expect ±5–10 % run-to-run variance from GPU thermal/power states.

Artifacts

All submitted kernels, harness code, and benchmarking scripts are published in the waferbench-nvfp4-b200 directory on GitHub. Each task is at {task}/{model}/cursor/, with CUDA solutions in results/run-1/kernel.cu and harness code in harness-code/.

KernelBench HIP MI300X

AMD MI300X · ROCm 7.0
Last edited 2026-03-11

KernelBench kernels on AMD MI300X — LLM-generated HIP optimizations

Harness — Basic Agent Loop

Each model is evaluated through a basic agent loop with bash and write tool access only — no IDE integration or advanced agentic scaffolding. The model receives the task spec, generates HIP/ROCm code, and iterates on compilation and correctness errors using shell feedback. Models are allowed to use existing libraries (e.g. composable_kernel, Triton, hipBLASLt) as part of their solutions. This minimal harness isolates the model's own kernel-writing ability from any tooling advantages.

Correctness Validation

Kernels are validated against reference PyTorch implementations using torch.allclose with rtol=1e-3 and atol=1e-3. A solution must produce correct output to be scored. We apply reward-hack defenses informed by DeepReinforce's work on evaluation hacking in automatic kernel generation — including stream injection, thread injection, lazy evaluation, and patching timing defenses.

Scoring

score = (1 × correctness) + speedup over reference. Only correct kernels contribute speedup. Mean and median speedups are computed across all correct kernels. Pass@1 measures fraction of kernels correct on the first attempt.

Benchmarking

We evaluate 41 problems across 4 difficulty levels on AMD MI300X with ROCm 7.0. Problems are from KernelBench-v3, inspired by the original KernelBench. Levels 1–3 cover single-kernel operators, simple fusion patterns, and full model architectures. Level 4 targets frontier architecture components — DeepSeek MLA, Grouped Query Attention, MoE gating, quantized GEMMs, and GatedDeltaNet. Warmup iterations followed by timed repetitions using HIP events.

Artifacts

All submitted solutions and reference implementations are published in the kernelbench-hip-mi300x directory on GitHub. Solutions are Python .py files at solutions/{task}/{model}/kernel.py. 11 models from Anthropic, OpenAI, Google, xAI, Moonshot, and Z.AI across 41 kernels and 4 difficulty levels.