Orak

A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

Dongmin Park*, Minkyu Kim*, Beongjun Choi*, Junhyuck Kim, Keon Lee, Jonghyun Lee,
Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, Ameya S. Mahabaleshwarkar,
Bilal Kartal, Pritam Biswas, Yoshi Suhar, Kangwook Lee, and Jaewoong Cho

Performance

- The leaderboard is lastly updated at 2025.06.09.
- The Avg.Rank metric refers to the ranking of the agent over all 12 games in Orak.
- ✔️ Checked indicates that we, the Orak team at KRAFTON AI, acknowledged the result by accepting code submssion and reproducing the game score.
- If you want to submit your model and agentic strategy to the leaderboard, please check the submission guideline document.

Overview

ORAK benchmark overview
Claude playing "Ace Attorney"
Claude playing "Baba Is You"

Orak is a benchmark for evaluating the agentic capabilities of LLMs on diverse video games. Check out more details in our paper and github repo!

Citation

@article{park2025orak,
  title     = {Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games},
  author    = {Park, Dongmin and Kim, Minkyu and Choi, Beongjun and Kim, Junhyuck and Lee, Keon and Lee, Jonghyun and Park, Inkyu and Lee, Byeong-Uk and Hwang, Jaeyoung and Ahn, Jaewoo and Mahabaleshwarkar, Ameya S. and Kartal, Bilal and Biswas, Pritam and Suhara, Yoshi and Lee, Kangwook and Cho, Jaewoong},
  year      = {2025},
  eprint    = {2506.03610},
  archivePrefix = {arXiv},
  note      = {arXiv:2506.03610}
}

Correspondence to: dongmin.park@krafton.com, and thanks to the website template from BARLOG.