DAX LLM Benchmark
Which LLM is best at DAX?
DAXBench tests how models understand, write, and reason about DAX and Power BI.
Methodology designed by Maxim Anatsko.
Last updated: Jun 10, 2026
127 models · 30 tasks · Initial Release
Model Leaderboard
Ranked by score
| Model | ||
|---|---|---|
1 | Gemini 3.1 Flash Lite PreviewHIGH Google | 97.4% |
2 | Claude Fable 5 Anthropic | 96.9% |
3 | GPT-5.3 Chat OpenAI | 96.9% |
4 | Qwen3.5 Plus 2026-02-15MED Qwen | 96.8% |
5 | GLM 5 Z.AI | 96.2% |
6 | Qwen3.7 Max Qwen | 94.5% |
7 | Gemini 3.1 Pro PreviewHIGH Google | 94.5% |
8 | Gemma 4 31B Google | 94.5% |
9 | Qwen3.6 Plus Preview (free) Qwen | 93.9% |
10 | Qwen3.5 397B A17B Qwen | 93.9% |
11 | GPT-5.4 Mini OpenAI | 93.3% |
12 | GPT-5.5 OpenAI | 92.1% |
13 | Qwen3.6 Max Preview Qwen | 91.5% |
14 | Qwen3.5-FlashMED Qwen | 90.8% |
15 | GLM 5.1 Z.AI | 90.3% |
16 | Qwen3.6 Plus (free) Qwen | 89.7% |
17 | GLM 5V Turbo Z.AI | 89.1% |
18 | GPT-5.3-CodexHIGH OpenAI | 88.6% |
19 | gpt-oss-120b OpenAI | 88.0% |
20 | Claude Sonnet 4.6MED Anthropic | 87.4% |
21 | Claude Sonnet 4 Anthropic | 87.4% |
22 | KAT-Coder-Pro V2 Kwaipilot | 87.2% |
23 | GLM 5 Turbo Z.AI | 86.7% |
24 | Gemini 2.5 Flash Preview 09-2025 Google | 86.2% |
25 | GPT-5.1-Codex-Max OpenAI | 85.6% |
26 | Claude Opus 4.8 Anthropic | 85.4% |
27 | Gemini 3 Pro Preview Google | 84.9% |
28 | Claude Sonnet 4.5 Anthropic | 84.4% |
29 | o3 OpenAI | 84.4% |
30 | Gemini 3.1 Flash Lite Google | 84.4% |
31 | Kimi K2 Thinking Moonshot AI | 84.4% |
32 | GPT-5.4HIGH OpenAI | 83.8% |
33 | Grok 4.3 xAI | 83.7% |
34 | Grok 4.20 BetaHIGH xAI | 83.0% |
35 | Claude Opus 4.5 Anthropic | 82.7% |
36 | Grok 4 xAI | 82.7% |
37 | Claude Opus 4.6 Anthropic | 82.0% |
38 | Gemini 3 Flash Preview Google | 81.4% |
39 | GPT-5.2 OpenAI | 81.4% |
40 | R1 DeepSeek | 81.3% |
41 | Grok Build 0.1 xAI | 81.3% |
42 | DeepSeek V4 Pro DeepSeek | 80.8% |
43 | Aurora Alpha Openrouter | 80.6% |
44 | GPT-5.2 Chat OpenAI | 80.1% |
45 | Qwen3 Max Thinking Qwen | 80.1% |
46 | Gemini 2.5 Flash Google | 79.6% |
47 | Kimi K2.6 Moonshot AI | 78.7% |
48 | DeepSeek V3.2 Speciale DeepSeek | 78.3% |
49 | DeepSeek V3.1 DeepSeek | 78.3% |
50 | Grok 4.20 Multi-Agent BetaHIGH xAI | 77.9% |
Changelog
Loading...
Loading value analysis...
About This Benchmark
Evaluation Method
Models are tested against DAX tasks of varying complexity using the Contoso sample dataset. Responses are evaluated for syntax correctness and output accuracy.
Scoring System
Harder tasks are worth more points. Correct solutions also earn bonus points for following DAX best practices, writing efficient code, and producing clear, readable output.
Task Categories
Tasks cover aggregation, time intelligence, filtering, calculations, iterators, and context transitions across basic, intermediate, and advanced levels.