DAX LLM Benchmark

Which LLM is best at DAX?
DAXBench tests how models understand, write, and reason about DAX and Power BI.
Methodology designed by Maxim Anatsko.

Last updated: Apr 2, 2026

110 models · 30 tasks · Initial Release

Model Leaderboard

Ranked by score

	Model				Tasks
1	Gemini 3.1 Flash Lite PreviewHIGH Google	97.4%	96.7%	100.0%	29/30
2	GPT-5.3 Chat OpenAI	96.2%	96.7%	100.0%	29/30
3	GLM 5 Z.AI	96.2%	96.7%	100.0%	29/30
4	GPT-5.4 Mini OpenAI	96.2%	96.7%	100.0%	29/30
5	Gemma 4 31B Google	94.5%	93.3%	100.0%	28/30
6	Gemini 3.1 Pro PreviewHIGH Google	93.8%	93.3%	100.0%	28/30
7	Qwen3.6 Plus Preview (free) Qwen	93.3%	93.3%	100.0%	28/30
8	Qwen3.5-FlashMED Qwen	93.2%	93.3%	100.0%	28/30
9	GLM 5V Turbo Z.AI	91.5%	90.0%	100.0%	27/30
10	Qwen3.5 397B A17B Qwen	90.3%	90.0%	100.0%	27/30
11	Qwen3.5 Plus 2026-02-15MED Qwen	89.7%	90.0%	100.0%	27/30
12	Qwen3.6 Plus (free) Qwen	89.0%	90.0%	100.0%	27/30
13	GPT-5.3-CodexHIGH OpenAI	88.6%	86.7%	100.0%	26/30
14	gpt-oss-120b OpenAI	85.6%	83.3%	100.0%	25/30
15	GPT-5.1-Codex-Max OpenAI	85.0%	83.3%	100.0%	25/30
16	KAT-Coder-Pro V2 Kwaipilot	84.8%	83.3%	100.0%	25/30
17	Claude Sonnet 4.6MED Anthropic	84.5%	83.3%	100.0%	25/30
18	GLM 5 Turbo Z.AI	84.3%	83.3%	100.0%	25/30
19	Gemini 3 Flash Preview Google	83.8%	80.0%	100.0%	24/30
20	Gemini 2.5 Flash Preview 09-2025 Google	83.8%	80.0%	100.0%	24/30
21	GPT-5.4HIGH OpenAI	83.2%	80.0%	100.0%	24/30
22	Claude Opus 4.5 Anthropic	82.7%	80.0%	100.0%	24/30
23	Claude Opus 4.6 Anthropic	82.0%	80.0%	100.0%	24/30
24	R1 DeepSeek	81.3%	80.0%	100.0%	24/30
25	Claude Sonnet 4 Anthropic	81.3%	80.0%	100.0%	24/30
26	Gemini 3 Pro Preview Google	81.3%	80.0%	96.7%	24/30
27	o3 OpenAI	80.2%	80.0%	100.0%	24/30
28	Grok 4.20 BetaHIGH xAI	80.1%	80.0%	100.0%	24/30
29	GPT-5.2 Chat OpenAI	80.1%	80.0%	100.0%	24/30
30	DeepSeek V3.2 DeepSeek	79.9%	80.0%	100.0%	24/30
31	GPT-5.2 OpenAI	78.4%	76.7%	100.0%	23/30
32	Kimi K2 Thinking Moonshot AI	78.4%	76.7%	100.0%	23/30
33	Aurora Alpha Openrouter	78.2%	76.7%	100.0%	23/30
34	Claude Sonnet 4.5 Anthropic	77.9%	76.7%	100.0%	23/30
35	Grok 4.20 Multi-Agent BetaHIGH xAI	77.9%	76.7%	100.0%	23/30
36	Hunter Alpha Openrouter	77.8%	76.7%	100.0%	23/30
37	Grok 4 xAI	76.7%	73.3%	100.0%	22/30
38	Gemini 2.0 Flash Google	76.6%	73.3%	100.0%	22/30
39	Gemini 2.0 Flash Experimental (free) Google	76.6%	73.3%	100.0%	22/30
40	o4 Mini OpenAI	76.5%	76.7%	100.0%	23/30
41	Gemini 2.5 Flash Google	76.0%	73.3%	100.0%	22/30
42	DeepSeek V3.1 DeepSeek	75.9%	73.3%	100.0%	22/30
43	MiMo-V2-Omni Xiaomi	74.7%	73.3%	90.0%	22/30
44	DeepSeek V3.2 Speciale DeepSeek	74.7%	73.3%	100.0%	22/30
45	MiMo-V2-Pro Xiaomi	74.7%	73.3%	100.0%	22/30
46	Llama 4 Maverick Meta	74.4%	73.3%	96.7%	22/30
47	Nemotron 3 Super (free) Nvidia	74.2%	73.3%	100.0%	22/30
48	GPT-4o-mini (2024-07-18) OpenAI	74.1%	73.3%	100.0%	22/30
49	R1 0528 DeepSeek	73.7%	70.0%	100.0%	21/30
50	Grok Code Fast 1 xAI	73.7%	70.0%	100.0%	21/30

About This Benchmark

Evaluation Method

Models are tested against DAX tasks of varying complexity using the Contoso sample dataset. Responses are evaluated for syntax correctness and output accuracy.

Scoring System

Harder tasks are worth more points. Correct solutions also earn bonus points for following DAX best practices, writing efficient code, and producing clear, readable output.

Task Categories

Tasks cover aggregation, time intelligence, filtering, calculations, iterators, and context transitions across basic, intermediate, and advanced levels.

Browse by Category

Browse All Tasks

DAX LLM Benchmark

Model Leaderboard

Changelog

About This Benchmark

Evaluation Method

Scoring System

Task Categories

Browse by Category

Aggregation

Time Intelligence

Filtering

Calculation

Table Manipulation

Iterator

Context Transition