DAX LLM Benchmark
Which LLM is best at DAX?
DAXBench tests how models understand, write, and reason about DAX and Power BI.
Methodology designed by Maxim Anatsko.
Last updated: Apr 2, 2026
110 models · 30 tasks · Initial Release
Model Leaderboard
Ranked by score
| Model | ||
|---|---|---|
1 | Gemini 3.1 Flash Lite PreviewHIGH Google | 97.4% |
2 | GPT-5.3 Chat OpenAI | 96.2% |
3 | GLM 5 Z.AI | 96.2% |
4 | GPT-5.4 Mini OpenAI | 96.2% |
5 | Gemma 4 31B Google | 94.5% |
6 | Gemini 3.1 Pro PreviewHIGH Google | 93.8% |
7 | Qwen3.6 Plus Preview (free) Qwen | 93.3% |
8 | Qwen3.5-FlashMED Qwen | 93.2% |
9 | GLM 5V Turbo Z.AI | 91.5% |
10 | Qwen3.5 397B A17B Qwen | 90.3% |
11 | Qwen3.5 Plus 2026-02-15MED Qwen | 89.7% |
12 | Qwen3.6 Plus (free) Qwen | 89.0% |
13 | GPT-5.3-CodexHIGH OpenAI | 88.6% |
14 | gpt-oss-120b OpenAI | 85.6% |
15 | GPT-5.1-Codex-Max OpenAI | 85.0% |
16 | KAT-Coder-Pro V2 Kwaipilot | 84.8% |
17 | Claude Sonnet 4.6MED Anthropic | 84.5% |
18 | GLM 5 Turbo Z.AI | 84.3% |
19 | Gemini 3 Flash Preview Google | 83.8% |
20 | Gemini 2.5 Flash Preview 09-2025 Google | 83.8% |
21 | GPT-5.4HIGH OpenAI | 83.2% |
22 | Claude Opus 4.5 Anthropic | 82.7% |
23 | Claude Opus 4.6 Anthropic | 82.0% |
24 | R1 DeepSeek | 81.3% |
25 | Claude Sonnet 4 Anthropic | 81.3% |
26 | Gemini 3 Pro Preview Google | 81.3% |
27 | o3 OpenAI | 80.2% |
28 | Grok 4.20 BetaHIGH xAI | 80.1% |
29 | GPT-5.2 Chat OpenAI | 80.1% |
30 | DeepSeek V3.2 DeepSeek | 79.9% |
31 | GPT-5.2 OpenAI | 78.4% |
32 | Kimi K2 Thinking Moonshot AI | 78.4% |
33 | Aurora Alpha Openrouter | 78.2% |
34 | Claude Sonnet 4.5 Anthropic | 77.9% |
35 | Grok 4.20 Multi-Agent BetaHIGH xAI | 77.9% |
36 | Hunter Alpha Openrouter | 77.8% |
37 | Grok 4 xAI | 76.7% |
38 | Gemini 2.0 Flash Google | 76.6% |
39 | Gemini 2.0 Flash Experimental (free) Google | 76.6% |
40 | o4 Mini OpenAI | 76.5% |
41 | Gemini 2.5 Flash Google | 76.0% |
42 | DeepSeek V3.1 DeepSeek | 75.9% |
43 | MiMo-V2-Omni Xiaomi | 74.7% |
44 | DeepSeek V3.2 Speciale DeepSeek | 74.7% |
45 | MiMo-V2-Pro Xiaomi | 74.7% |
46 | Llama 4 Maverick Meta | 74.4% |
47 | Nemotron 3 Super (free) Nvidia | 74.2% |
48 | GPT-4o-mini (2024-07-18) OpenAI | 74.1% |
49 | R1 0528 DeepSeek | 73.7% |
50 | Grok Code Fast 1 xAI | 73.7% |
Changelog
Loading...
Loading value analysis...
About This Benchmark
Evaluation Method
Models are tested against DAX tasks of varying complexity using the Contoso sample dataset. Responses are evaluated for syntax correctness and output accuracy.
Scoring System
Harder tasks are worth more points. Correct solutions also earn bonus points for following DAX best practices, writing efficient code, and producing clear, readable output.
Task Categories
Tasks cover aggregation, time intelligence, filtering, calculations, iterators, and context transitions across basic, intermediate, and advanced levels.