Performance
Model (Agentic Strategy)
|
Avg.Rank
|
StreetFighterIII
|
Supermario
|
AceAttorney
|
HerStory
|
Pokémon
|
DarkestDun
|
MineCraft
|
StardewValley
|
StarCraftII
|
SlayTheSpire
|
BabaIsYou
|
2048
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
✔️Llama-3.2-1B-it (default) |
17.8 |
0.0 ± 0.0 |
18.7 ± 8.6 |
1.3 ± 2.2 |
2.1 ± 1.2 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
6.7 ± 11.5 |
0.0 ± 0.1 |
✔️Llama-3.2-3B-it (default) |
14.6 |
13.3 ± 5.8 |
31.8 ± 10.1 |
4.6 ± 1.3 |
4.2 ± 1.1 |
0.0 ± 0.0 |
47.5 ± 39.2 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
20.0 ± 0.0 |
0.3 ± 0.2 |
✔️Qwen-2.5-3B-it (default) |
15.8 |
20.0 ± 0.0 |
23.4 ± 14.1 |
20.0 ± 17.4 |
1.2 ± 1.1 |
0.0 ± 0.0 |
44.8 ± 22.2 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
13.3 ± 11.5 |
0.1 ± 0.1 |
✔️Qwen-2.5-7B-it (default) |
12.9 |
16.7 ± 11.5 |
27.2 ± 9.6 |
9.3 ± 0.2 |
8.5 ± 1.9 |
0.0 ± 0.0 |
88.8 ± 2.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
5.0 ± 0.0 |
20.0 ± 0.0 |
0.6 ± 0.4 |
✔️Minitron-4B-it (default) |
14.9 |
16.7 ± 11.5 |
24.4 ± 6.0 |
35.7 ± 4.5 |
4.5 ± 2.2 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
20.0 ± 0.0 |
0.1 ± 0.0 |
✔️Minitron-8B-it (default) |
12.5 |
23.3 ± 5.8 |
31.3 ± 12.8 |
29.9 ± 3.6 |
8.2 ± 1.8 |
0.0 ± 0.0 |
63.8 ± 30.4 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
20.0 ± 0.0 |
0.7 ± 0.7 |
✔️GPT-4o-mini-2025-04-16 (default) |
11.2 |
16.7 ± 11.5 |
28.8 ± 8.8 |
28.4 ± 2.8 |
21.1 ± 5.5 |
0.0 ± 0.0 |
81.3 ± 5.8 |
46.0 ± 7.0 |
18.9 ± 32.7 |
75.0 ± 50.0 |
3.3 ± 2.9 |
13.3 ± 11.5 |
1.1 ± 1.0 |
🥈✔️GPT-4o-2024-08-06 (default) |
4.3 |
29.7 ± 14.3 |
34.1 ± 14.2 |
85.3 ± 1.5 |
64.0 ± 5.1 |
38.9 ± 9.6 |
93.4 ± 1.5 |
71.0 ± 7.0 |
95.7 ± 5.7 |
100.0 ± 0.0 |
23.6 ± 22.1 |
20.0 ± 0.0 |
5.6 ± 1.5 |
🥉✔️o3-mini-2025-01-31 (default) |
4.6 |
33.3 ± 15.3 |
34.9 ± 14.6 |
91.7 ± 1.5 |
66.3 ± 3.6 |
0.0 ± 0.0 |
89.0 ± 2.1 |
75.0 ± 0.0 |
64.7 ± 18.8 |
25.0 ± 50.0 |
15.0 ± 0.0 |
73.3 ± 46.2 |
25.3 ± 7.3 |
🥇✔️Gemini-2.5-pro-preview-03-25 (default) |
3.9 |
13.3 ± 11.5 |
38.0 ± 14.4 |
55.7 ± 3.4 |
67.3 ± 3.3 |
83.3 ± 0.0 |
93.7 ± 1.6 |
75.0 ± 0.0 |
69.6 ± 11.9 |
100.0 ± 0.0 |
51.9 ± 31.9 |
73.3 ± 46.2 |
5.1 ± 2.5 |
✔️Claude-3.7-sonnet (default) |
6.3 |
16.7 ± 11.5 |
31.7 ± 8.2 |
81.9 ± 1.6 |
62.6 ± 2.6 |
63.9 ± 19.2 |
89.9 ± 2.5 |
75.0 ± 0.0 |
63.0 ± 24.6 |
50.0 ± 57.7 |
15.0 ± 0.0 |
46.7 ± 46.2 |
5.3 ± 2.7 |
✔️Deepseek-R1-2025-01-20 (default) |
6.4 |
20.0 ± 0.0 |
28.7 ± 13.2 |
83.3 ± 1.5 |
66.9 ± 3.9 |
75.0 ± 0.0 |
91.7 ± 1.1 |
41.7 ± 0.0 |
77.7 ± 13.6 |
50.0 ± 57.7 |
24.9 ± 17.1 |
20.0 ± 0.0 |
11.5 ± 3.4 |
✔️Llama-3.2-3B-it (zeroshot) |
15.4 |
13.3 ± 5.8 |
21.2 ± 8.2 |
5.7 ± 3.2 |
4.2 ± 1.1 |
0.0 ± 0.0 |
47.5 ± 39.2 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
20.0 ± 0.0 |
0.3 ± 0.2 |
✔️Llama-3.2-3B-it (reflection) |
13.6 |
30.0 ± 17.3 |
32.4 ± 8.6 |
4.6 ± 1.3 |
4.4 ± 1.3 |
0.0 ± 0.0 |
47.3 ± 39.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
20.0 ± 0.0 |
0.0 ± 0.0 |
✔️Llama-3.2-3B-it (planning) |
14.5 |
20.0 ± 0.0 |
27.0 ± 8.4 |
4.6 ± 1.3 |
5.1 ± 1.0 |
0.0 ± 0.0 |
56.3 ± 23.6 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
20.0 ± 0.0 |
0.1 ± 0.1 |
✔️Llama-3.2-3B-it (ref-plan) |
14.1 |
16.7 ± 20.8 |
31.8 ± 10.1 |
3.8 ± 0.0 |
5.4 ± 0.4 |
0.0 ± 0.0 |
57.0 ± 31.6 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
0.0 ± 0.0 |
20.0 ± 0.0 |
0.1 ± 0.2 |
✔️GPT-4o-2024-08-06 (zeroshot) |
7.5 |
29.7 ± 14.3 |
29.6 ± 9.2 |
49.9 ± 1.3 |
64.0 ± 5.1 |
33.3 ± 0.0 |
93.4 ± 1.5 |
0.0 ± 0.0 |
40.5 ± 35.2 |
50.0 ± 57.7 |
24.7 ± 9.2 |
20.0 ± 0.0 |
5.6 ± 1.5 |
✔️GPT-4o-2024-08-06 (reflection) |
7.5 |
23.3 ± 20.8 |
32.3 ± 15.5 |
85.3 ± 1.5 |
61.3 ± 0.4 |
36.1 ± 4.8 |
85.2 ± 10.3 |
50.0 ± 0.0 |
18.3 ± 5.2 |
0.0 ± 0.0 |
41.3 ± 19.6 |
20.0 ± 0.0 |
3.5 ± 2.9 |
✔️GPT-4o-2024-08-06 (planning) |
7.3 |
30.0 ± 26.5 |
29.4 ± 12.5 |
52.7 ± 0.5 |
59.2 ± 5.7 |
33.3 ± 0.0 |
82.0 ± 8.6 |
13.0 ± 0.0 |
64.6 ± 23.7 |
50.0 ± 57.7 |
35.3 ± 9.2 |
20.0 ± 0.0 |
6.0 ± 5.5 |
✔️GPT-4o-2024-08-06 (ref-plan) |
4.8 |
23.3 ± 20.8 |
34.1 ± 14.2 |
52.8 ± 0.5 |
61.8 ± 4.9 |
38.9 ± 9.6 |
91.6 ± 2.5 |
50.0 ± 0.0 |
95.7 ± 5.7 |
100.0 ± 0.0 |
36.0 ± 25.5 |
20.0 ± 0.0 |
7.0 ± 5.7 |
Agent
|
Avg.Rank
|
Elo in StreetFighterIII
|
Elo in StarCraftII
|
---|---|---|---|
✔️Llama-3.2-3B-it (zero-shot) |
7.5 |
1358.4 |
1424.9 |
✔️Qwen-2.5-7B-it (zero-shot) |
6.5 |
1415.4 |
1445.2 |
✔️Minitron-8B-it (zero-shot) |
1.5 |
1680.1 |
1525.0 |
✔️GPT-4o-mini-2025-04-16 (zero-shot) |
4.5 |
1453.0 |
1500.0 |
✔️GPT-4o-2024-08-06 (zero-shot) |
3 |
1602.9 |
1525.0 |
✔️o3-mini-2025-01-31 (zero-shot) |
2 |
1602.9 |
1500.0 |
✔️Claude-3.7-sonnet (zero-shot) |
3.5 |
1434.3 |
1550.1 |
- The leaderboard is lastly updated at 2025.06.09.
- The Avg.Rank metric refers to the ranking of the agent over all 12 games in Orak.
- ✔️ Checked indicates that we, the Orak team at KRAFTON AI,
acknowledged the result by accepting code submssion and reproducing the game score.
- If you want to submit your model and agentic strategy to the leaderboard, please check the submission guideline document.
Overview



Orak is a benchmark for evaluating the agentic capabilities of LLMs on diverse video games. Check out more details in our paper and github repo!
Citation
@article{park2025orak,
title = {Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games},
author = {Park, Dongmin and Kim, Minkyu and Choi, Beongjun and Kim, Junhyuck and Lee, Keon and Lee, Jonghyun and Park, Inkyu and Lee, Byeong-Uk and Hwang, Jaeyoung and Ahn, Jaewoo and Mahabaleshwarkar, Ameya S. and Kartal, Bilal and Biswas, Pritam and Suhara, Yoshi and Lee, Kangwook and Cho, Jaewoong},
year = {2025},
eprint = {2506.03610},
archivePrefix = {arXiv},
note = {arXiv:2506.03610}
}
Correspondence to: dongmin.park@krafton.com, and thanks to the website template from BARLOG.