DAX LLM Benchmark

Which LLM is best at DAX?
DAXBench tests how models understand, write, and reason about DAX and Power BI.
Methodology designed by Maxim Anatsko.

Last updated: Jun 10, 2026

127 models · 30 tasks · Initial Release

Model Leaderboard

Ranked by score

Model
1
Gemini 3.1 Flash Lite PreviewHIGH
Google
97.4%
2
Claude Fable 5
Anthropic
96.9%
3
GPT-5.3 Chat
OpenAI
96.9%
4
Qwen3.5 Plus 2026-02-15MED
Qwen
96.8%
5
GLM 5
Z.AI
96.2%
6
Qwen3.7 Max
Qwen
94.5%
7
Gemini 3.1 Pro PreviewHIGH
Google
94.5%
8
Gemma 4 31B
Google
94.5%
9
Qwen3.6 Plus Preview (free)
Qwen
93.9%
10
Qwen3.5 397B A17B
Qwen
93.9%
11
GPT-5.4 Mini
OpenAI
93.3%
12
GPT-5.5
OpenAI
92.1%
13
Qwen3.6 Max Preview
Qwen
91.5%
14
Qwen3.5-FlashMED
Qwen
90.8%
15
GLM 5.1
Z.AI
90.3%
16
Qwen3.6 Plus (free)
Qwen
89.7%
17
GLM 5V Turbo
Z.AI
89.1%
18
GPT-5.3-CodexHIGH
OpenAI
88.6%
19
gpt-oss-120b
OpenAI
88.0%
20
Claude Sonnet 4.6MED
Anthropic
87.4%
21
Claude Sonnet 4
Anthropic
87.4%
22
KAT-Coder-Pro V2
Kwaipilot
87.2%
23
GLM 5 Turbo
Z.AI
86.7%
24
Gemini 2.5 Flash Preview 09-2025
Google
86.2%
25
GPT-5.1-Codex-Max
OpenAI
85.6%
26
Claude Opus 4.8
Anthropic
85.4%
27
Gemini 3 Pro Preview
Google
84.9%
28
Claude Sonnet 4.5
Anthropic
84.4%
29
o3
OpenAI
84.4%
30
Gemini 3.1 Flash Lite
Google
84.4%
31
Kimi K2 Thinking
Moonshot AI
84.4%
32
GPT-5.4HIGH
OpenAI
83.8%
33
Grok 4.3
xAI
83.7%
34
Grok 4.20 BetaHIGH
xAI
83.0%
35
Claude Opus 4.5
Anthropic
82.7%
36
Grok 4
xAI
82.7%
37
Claude Opus 4.6
Anthropic
82.0%
38
Gemini 3 Flash Preview
Google
81.4%
39
GPT-5.2
OpenAI
81.4%
40
R1
DeepSeek
81.3%
41
Grok Build 0.1
xAI
81.3%
42
DeepSeek V4 Pro
DeepSeek
80.8%
43
Aurora Alpha
Openrouter
80.6%
44
GPT-5.2 Chat
OpenAI
80.1%
45
Qwen3 Max Thinking
Qwen
80.1%
46
Gemini 2.5 Flash
Google
79.6%
47
Kimi K2.6
Moonshot AI
78.7%
48
DeepSeek V3.2 Speciale
DeepSeek
78.3%
49
DeepSeek V3.1
DeepSeek
78.3%
50
Grok 4.20 Multi-Agent BetaHIGH
xAI
77.9%

About This Benchmark

Evaluation Method

Models are tested against DAX tasks of varying complexity using the Contoso sample dataset. Responses are evaluated for syntax correctness and output accuracy.

Scoring System

Harder tasks are worth more points. Correct solutions also earn bonus points for following DAX best practices, writing efficient code, and producing clear, readable output.

Task Categories

Tasks cover aggregation, time intelligence, filtering, calculations, iterators, and context transitions across basic, intermediate, and advanced levels.

Browse by Category

Browse All Tasks