DAX LLM Benchmark

Which LLM is best at DAX?
DAXBench tests how models understand, write, and reason about DAX and Power BI.
Methodology designed by Maxim Anatsko.

Last updated: Apr 2, 2026

110 models · 30 tasks · Initial Release

Model Leaderboard

Ranked by score

Model
1
Gemini 3.1 Flash Lite PreviewHIGH
Google
97.4%
2
GPT-5.3 Chat
OpenAI
96.2%
3
GLM 5
Z.AI
96.2%
4
GPT-5.4 Mini
OpenAI
96.2%
5
Gemma 4 31B
Google
94.5%
6
Gemini 3.1 Pro PreviewHIGH
Google
93.8%
7
Qwen3.6 Plus Preview (free)
Qwen
93.3%
8
Qwen3.5-FlashMED
Qwen
93.2%
9
GLM 5V Turbo
Z.AI
91.5%
10
Qwen3.5 397B A17B
Qwen
90.3%
11
Qwen3.5 Plus 2026-02-15MED
Qwen
89.7%
12
Qwen3.6 Plus (free)
Qwen
89.0%
13
GPT-5.3-CodexHIGH
OpenAI
88.6%
14
gpt-oss-120b
OpenAI
85.6%
15
GPT-5.1-Codex-Max
OpenAI
85.0%
16
KAT-Coder-Pro V2
Kwaipilot
84.8%
17
Claude Sonnet 4.6MED
Anthropic
84.5%
18
GLM 5 Turbo
Z.AI
84.3%
19
Gemini 3 Flash Preview
Google
83.8%
20
Gemini 2.5 Flash Preview 09-2025
Google
83.8%
21
GPT-5.4HIGH
OpenAI
83.2%
22
Claude Opus 4.5
Anthropic
82.7%
23
Claude Opus 4.6
Anthropic
82.0%
24
R1
DeepSeek
81.3%
25
Claude Sonnet 4
Anthropic
81.3%
26
Gemini 3 Pro Preview
Google
81.3%
27
o3
OpenAI
80.2%
28
Grok 4.20 BetaHIGH
xAI
80.1%
29
GPT-5.2 Chat
OpenAI
80.1%
30
DeepSeek V3.2
DeepSeek
79.9%
31
GPT-5.2
OpenAI
78.4%
32
Kimi K2 Thinking
Moonshot AI
78.4%
33
Aurora Alpha
Openrouter
78.2%
34
Claude Sonnet 4.5
Anthropic
77.9%
35
Grok 4.20 Multi-Agent BetaHIGH
xAI
77.9%
36
Hunter Alpha
Openrouter
77.8%
37
Grok 4
xAI
76.7%
38
Gemini 2.0 Flash
Google
76.6%
39
Gemini 2.0 Flash Experimental (free)
Google
76.6%
40
o4 Mini
OpenAI
76.5%
41
Gemini 2.5 Flash
Google
76.0%
42
DeepSeek V3.1
DeepSeek
75.9%
43
MiMo-V2-Omni
Xiaomi
74.7%
44
DeepSeek V3.2 Speciale
DeepSeek
74.7%
45
MiMo-V2-Pro
Xiaomi
74.7%
46
Llama 4 Maverick
Meta
74.4%
47
Nemotron 3 Super (free)
Nvidia
74.2%
48
GPT-4o-mini (2024-07-18)
OpenAI
74.1%
49
R1 0528
DeepSeek
73.7%
50
Grok Code Fast 1
xAI
73.7%

About This Benchmark

Evaluation Method

Models are tested against DAX tasks of varying complexity using the Contoso sample dataset. Responses are evaluated for syntax correctness and output accuracy.

Scoring System

Harder tasks are worth more points. Correct solutions also earn bonus points for following DAX best practices, writing efficient code, and producing clear, readable output.

Task Categories

Tasks cover aggregation, time intelligence, filtering, calculations, iterators, and context transitions across basic, intermediate, and advanced levels.

Browse by Category

Browse All Tasks