Orak

A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

Dongmin Park*, Minkyu Kim*, Beongjun Choi*, Junhyuck Kim, Keon Lee, Jonghyun Lee,
Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, Ameya S. Mahabaleshwarkar,
Bilal Kartal, Pritam Biswas, Yoshi Suhar, Kangwook Lee, and Jaewoong Cho

Performance

Model (Agentic Strategy)	Avg.Rank	StreetFighterIII	Supermario	AceAttorney	HerStory	Pokémon	DarkestDun	MineCraft	StardewValley	StarCraftII	SlayTheSpire	BabaIsYou	2048
✔️Llama-3.2-1B-it (default)	17.8	0.0 ± 0.0	18.7 ± 8.6	1.3 ± 2.2	2.1 ± 1.2	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	6.7 ± 11.5	0.0 ± 0.1
✔️Llama-3.2-3B-it (default)	14.6	13.3 ± 5.8	31.8 ± 10.1	4.6 ± 1.3	4.2 ± 1.1	0.0 ± 0.0	47.5 ± 39.2	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	20.0 ± 0.0	0.3 ± 0.2
✔️Qwen-2.5-3B-it (default)	15.8	20.0 ± 0.0	23.4 ± 14.1	20.0 ± 17.4	1.2 ± 1.1	0.0 ± 0.0	44.8 ± 22.2	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	13.3 ± 11.5	0.1 ± 0.1
✔️Qwen-2.5-7B-it (default)	12.9	16.7 ± 11.5	27.2 ± 9.6	9.3 ± 0.2	8.5 ± 1.9	0.0 ± 0.0	88.8 ± 2.0	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	5.0 ± 0.0	20.0 ± 0.0	0.6 ± 0.4
✔️Minitron-4B-it (default)	14.9	16.7 ± 11.5	24.4 ± 6.0	35.7 ± 4.5	4.5 ± 2.2	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	20.0 ± 0.0	0.1 ± 0.0
✔️Minitron-8B-it (default)	12.5	23.3 ± 5.8	31.3 ± 12.8	29.9 ± 3.6	8.2 ± 1.8	0.0 ± 0.0	63.8 ± 30.4	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	20.0 ± 0.0	0.7 ± 0.7
✔️GPT-4o-mini-2025-04-16 (default)	11.2	16.7 ± 11.5	28.8 ± 8.8	28.4 ± 2.8	21.1 ± 5.5	0.0 ± 0.0	81.3 ± 5.8	46.0 ± 7.0	18.9 ± 32.7	75.0 ± 50.0	3.3 ± 2.9	13.3 ± 11.5	1.1 ± 1.0
🥈✔️GPT-4o-2024-08-06 (default)	4.3	29.7 ± 14.3	34.1 ± 14.2	85.3 ± 1.5	64.0 ± 5.1	38.9 ± 9.6	93.4 ± 1.5	71.0 ± 7.0	95.7 ± 5.7	100.0 ± 0.0	23.6 ± 22.1	20.0 ± 0.0	5.6 ± 1.5
🥉✔️o3-mini-2025-01-31 (default)	4.6	33.3 ± 15.3	34.9 ± 14.6	91.7 ± 1.5	66.3 ± 3.6	0.0 ± 0.0	89.0 ± 2.1	75.0 ± 0.0	64.7 ± 18.8	25.0 ± 50.0	15.0 ± 0.0	73.3 ± 46.2	25.3 ± 7.3
🥇✔️Gemini-2.5-pro-preview-03-25 (default)	3.9	13.3 ± 11.5	38.0 ± 14.4	55.7 ± 3.4	67.3 ± 3.3	83.3 ± 0.0	93.7 ± 1.6	75.0 ± 0.0	69.6 ± 11.9	100.0 ± 0.0	51.9 ± 31.9	73.3 ± 46.2	5.1 ± 2.5
✔️Claude-3.7-sonnet (default)	6.3	16.7 ± 11.5	31.7 ± 8.2	81.9 ± 1.6	62.6 ± 2.6	63.9 ± 19.2	89.9 ± 2.5	75.0 ± 0.0	63.0 ± 24.6	50.0 ± 57.7	15.0 ± 0.0	46.7 ± 46.2	5.3 ± 2.7
✔️Deepseek-R1-2025-01-20 (default)	6.4	20.0 ± 0.0	28.7 ± 13.2	83.3 ± 1.5	66.9 ± 3.9	75.0 ± 0.0	91.7 ± 1.1	41.7 ± 0.0	77.7 ± 13.6	50.0 ± 57.7	24.9 ± 17.1	20.0 ± 0.0	11.5 ± 3.4
✔️Llama-3.2-3B-it (zeroshot)	15.4	13.3 ± 5.8	21.2 ± 8.2	5.7 ± 3.2	4.2 ± 1.1	0.0 ± 0.0	47.5 ± 39.2	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	20.0 ± 0.0	0.3 ± 0.2
✔️Llama-3.2-3B-it (reflection)	13.6	30.0 ± 17.3	32.4 ± 8.6	4.6 ± 1.3	4.4 ± 1.3	0.0 ± 0.0	47.3 ± 39.0	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	20.0 ± 0.0	0.0 ± 0.0
✔️Llama-3.2-3B-it (planning)	14.5	20.0 ± 0.0	27.0 ± 8.4	4.6 ± 1.3	5.1 ± 1.0	0.0 ± 0.0	56.3 ± 23.6	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	20.0 ± 0.0	0.1 ± 0.1
✔️Llama-3.2-3B-it (ref-plan)	14.1	16.7 ± 20.8	31.8 ± 10.1	3.8 ± 0.0	5.4 ± 0.4	0.0 ± 0.0	57.0 ± 31.6	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0	20.0 ± 0.0	0.1 ± 0.2
✔️GPT-4o-2024-08-06 (zeroshot)	7.5	29.7 ± 14.3	29.6 ± 9.2	49.9 ± 1.3	64.0 ± 5.1	33.3 ± 0.0	93.4 ± 1.5	0.0 ± 0.0	40.5 ± 35.2	50.0 ± 57.7	24.7 ± 9.2	20.0 ± 0.0	5.6 ± 1.5
✔️GPT-4o-2024-08-06 (reflection)	7.5	23.3 ± 20.8	32.3 ± 15.5	85.3 ± 1.5	61.3 ± 0.4	36.1 ± 4.8	85.2 ± 10.3	50.0 ± 0.0	18.3 ± 5.2	0.0 ± 0.0	41.3 ± 19.6	20.0 ± 0.0	3.5 ± 2.9
✔️GPT-4o-2024-08-06 (planning)	7.3	30.0 ± 26.5	29.4 ± 12.5	52.7 ± 0.5	59.2 ± 5.7	33.3 ± 0.0	82.0 ± 8.6	13.0 ± 0.0	64.6 ± 23.7	50.0 ± 57.7	35.3 ± 9.2	20.0 ± 0.0	6.0 ± 5.5
✔️GPT-4o-2024-08-06 (ref-plan)	4.8	23.3 ± 20.8	34.1 ± 14.2	52.8 ± 0.5	61.8 ± 4.9	38.9 ± 9.6	91.6 ± 2.5	50.0 ± 0.0	95.7 ± 5.7	100.0 ± 0.0	36.0 ± 25.5	20.0 ± 0.0	7.0 ± 5.7

Agent	Avg.Rank	Elo in StreetFighterIII	Elo in StarCraftII
✔️Llama-3.2-3B-it (zero-shot)	7.5	1358.4	1424.9
✔️Qwen-2.5-7B-it (zero-shot)	6.5	1415.4	1445.2
✔️Minitron-8B-it (zero-shot)	1.5	1680.1	1525.0
✔️GPT-4o-mini-2025-04-16 (zero-shot)	4.5	1453.0	1500.0
✔️GPT-4o-2024-08-06 (zero-shot)	3	1602.9	1525.0
✔️o3-mini-2025-01-31 (zero-shot)	2	1602.9	1500.0
✔️Claude-3.7-sonnet (zero-shot)	3.5	1434.3	1550.1

- The leaderboard is lastly updated at 2025.06.09.
- The Avg.Rank metric refers to the ranking of the agent over all 12 games in Orak.
- ✔️ Checked indicates that we, the Orak team at KRAFTON AI, acknowledged the result by accepting code submssion and reproducing the game score.
- If you want to submit your model and agentic strategy to the leaderboard, please check the submission guideline document.

Overview

Orak is a benchmark for evaluating the agentic capabilities of LLMs on diverse video games. Check out more details in our paper and github repo!

Citation

@article{park2025orak,
  title     = {Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games},
  author    = {Park, Dongmin and Kim, Minkyu and Choi, Beongjun and Kim, Junhyuck and Lee, Keon and Lee, Jonghyun and Park, Inkyu and Lee, Byeong-Uk and Hwang, Jaeyoung and Ahn, Jaewoo and Mahabaleshwarkar, Ameya S. and Kartal, Bilal and Biswas, Pritam and Suhara, Yoshi and Lee, Kangwook and Cho, Jaewoong},
  year      = {2025},
  eprint    = {2506.03610},
  archivePrefix = {arXiv},
  note      = {arXiv:2506.03610}
}

Correspondence to: dongmin.park@krafton.com, and thanks to the website template from BARLOG.