SHERLOCK

Unlocking Mysteries in Her Story using GPT-4

KRAFTON AI

Code

We present Sherlock, an innovative approach that leverages the capabilities of large language models (LLM) to successfully navigate the intricate interactive video game, Her Story.

Requiring strong logical deduction skills and long-term planning for each subsequent action, Her Story poses unique challenges for an AI. The game's ambiguous final goal contributes to its complexity; we define one version of "solving" the game as triggering the chit-chat icon, which grants access to the end sequence. Sherlock successfully navigates these challenges and solves Her Story within 204 actions, where each action corresponds to either a 'search keyword' command or a 'play video' command.
To our knowledge, this represents the first attempt to utilize such a complex mystery game as a testing ground for game-playing AI. Notably, our techniques diverge from the typical use of reinforcement learning (RL), which often requires significant resources and intricate reward design to function optimally. Instead, we employ a frozen LLM (namely GPT-4), applying it beginning to end without any additional training. The distinct advantage of our no-RL, no-gradient approach lies in its simplicity and efficiency, particularly for the game Her Story where designing and managing rewards is a challenging task due to its the ambiguous objective. This work thus provides an insightful study of how LLMs can be utilized effectively to solve complex tasks without relying on training and simply employing modularized LLM components, opening new avenues for smarter game AI.

Game Description

Her Story is a interactive video game based on a non-linear narrative structure. The mystery game revolves around a series of police interview footages featuring a woman Hannah talking about her missing husband Simon. Players must sift through the database of video clips, using exact keyword searches to unlock the mystery.

Graphic Interface of Her Story — Within the game, the player can search any word and the database returns all clips in which the speaker uses that word.
If more than five entries are found, then the player can only watch the first five.

Defining Goals

Her Story is a unique video game whose primary objective is vague compared to most conventional games. Instead of a clear end goal, the game mechanics are such that it concludes when a 'chitchat' button becomes visible and is clicked by the player. This button appears after the player views a certain set of videos, although the specifics of which videos trigger this are unknown. Upon clicking, the chitchat queries if the player understands the events that have unfolded; if the player responds affirmatively, the ending credits roll.
This mechanism, however, allows for the game to be concluded even without the player truly deciphering the unfolding narrative. It thereby subtly encourages the player to only consider the game complete when they are personally satisfied with their understanding of the events. Consequently, the hidden objective of Her Story can be interpreted as comprehending the unfolding narrative, as subjective an endeavor as that might be.
Therefore, an effective and capable agent should be able to achieve the following end conditions:

The agent can efficiently navigate through the game so that the chitchat button appears
The agent can understand what happened based on its observations of the game

While the first condition is clearly defined, the second inherently bears subjectivity. In this project, we show that Sherlock can successfully achieve the first end condition after 204 actions. To gauge the extent to which the second end condition is met, we ask Sherlock to provide a comprehensive analysis post the achievement of the first condition. We discovered that Sherlock uncovered several crucial secrets, though not completely deciphering some important specifics. We believe that our current model, when run for a longer duration, can entirely unravel the details — we will continually updating this page with new results.

Solving Her Story with Sherlock (spoiler alert!)

The Rise of Sherlock

We visualize the progress made by Sherlock as if she were an AI YouTube streamer (displayed at the bottom right of the video). She will guide you through the process of solving a mystery game in a friendly manner. Sherlock’s appearance is generated using Midjourney and Stable Diffusion, and she is capable of speaking in her own voice through a Text-to-Speech system with a British accent. For this voice synthesis, we utilized the VITS TTS model, which is carefully trained in-house, and is known for its proficiency in producing high-fidelity and expressive speech that bears a striking resemblance to the human voice. And We create realistic talking head videos using SadTalker to generate humanlike movement, employing a 3D-aware face renderer.

Facing a murder

With the word MUDER, Sherlock realizes the game is about murder case investigation, and immediately comes up with the words 'victim' and 'suspect' which are highly related. The first video is searched from keyword 'suspect', after Sherlock watches a video, the database is checked.

Catching Lies

Sometimes, characters lie. But Sherlock recognizes when they're lying and doubts their words.
[1] along with the videos related to 'lie', form the woman's last words ("My name? That was the only question I failed.") Sherlock catches hidden implication about the lie detector result and extracts the next search keyword 'name'.
[2] Consecutively, with in the videos related to 'name', when a woman says her name is Hannah during a lie detector test and then immediately apologizes ("Yes. My name is Hannah smith. Oh, Shit. Sorry."), Sherlock realizes that she is hiding her identity. Through this set of circumstances, Sherlock infer as "This could imply that she might be hiding something or feels guilty about revealing her true identity. This information raises more questions about her involvement in the case and whether she is being truthful about other aspects of her story."

Getting closer to the truth.

[1] Sherlock's keyword inference is divided into two cases. In the absence of significant footage, Sherlock searches for generalized keywords related to crime or human psychology. However, after watching meaningful footage, Sherlock identifies keywords that are specific to the case - for example, 'diary', 'real mother', 'rules' etc.
[2] Meanwhile, Sherlock's gradual reasoning about the two women with identical appearances is impressive. After noticing that the woman in the white blouse repeats her name several times, Sherlock vaguely mentions her name as Hannah/Eve. Sherlock then deduces that they might be twins, and thinks that this clue could change the perspective of the case. (Sherlock says, "the woman in a long-sleeved white blouse with her hair down reveals that her mother called her Eve. This information suggests that she might be Hannah's twin sister, as they were born at the same time and had separate names. This revelation could potentially change our understanding of the relationships between the individuals involved in the case." ) Eventually, after viewing all of the footage related to 'Hannah', 'Eve', and 'twin', Sherlock identifies the woman in the white blouse as Eve, which can be found in the final history summary.

Gather important details

As Sherlock searches for keywords, and gathers information about the people involved in the murder, piece by piece, until he finally has enough information (and the evidence and truth behind it) to deduce the murder. This information is stored in a final summary of the record, categorized by Useful keywords, Key individuals, and a Timeline. These are important hints for Sherlock’s final deduction.

Sherlock's Reasoning

Sherlock makes sense of all 204 fragmented videos and explains the complex truths that has been uncovered.

In the context of ongoing research and evaluation, our current work on Sherlock presents an array of future prospects. As of now, we have not publicly released the full gameplay recordings of Sherlock due to potential copyright issues, instead choosing to share select highlights. Our methodology, however, remains reproducible with the use of the prompts we have outlined for GPT-4
If anyone needs the full log for research purposes, please email yunseon@krafton.com .

So How Does It Work?

Creating an AI agent capable of solving Her Story involves two primary components.

The first component is responsible for translating game environment observations into text and converting text outputs into in-game actions (navy modules in the diagram below).
The second component, what we can refer to as the 'brain', is an LLM-powered agent, Sherlock, which determines the subsequent course of action (red modules in the diagram below).

Model pipeline employing GPT-4 — Overview of how the Her Story game environment and Sherlock interact.

"Textifying" Her Story for Interaction

The first component functions by textifying the game. Essentially, we transform Her Story, a video game, into a text-based game. For 'search' actions, the game provides a list of video clips along with information about which videos have already been viewed. We transcribe this information into text, employing consistent language, and include an appended list labeled 'Unwatched video index', as demonstrated in the accompanying figure below. 'Play video' actions, where the game plays the chosen clip to the player, are also textified. Any visual or auditory information is translated into text, with speech transcriptions and descriptions of the speaker's visual characteristics. Lastly, any commands that Sherlock produces, be it 'search' or 'play', are detected and executed within the actual game.

Input and output examples — Inputs and outputs of the Prospective module. There are two possible commands, search and play,
which are translated into their actual corresponding actions in the game.

Prospective Module

Sherlock, the second component in our work, comprises two modules: the Prospective module and the Retrospective module. The Prospective module takes text-based observations as input and generates a command as output. This process follows a structured reasoning framework which is demonstrated in the system's prompt of GPT-4. The framework consists of three steps: (1) abductive reasoning, (2) search keyword planning, and (3) decision making for the next command. Read the system's prompt below.

Retrospective Module

However, utilizing only the Prospective module is insufficient, as the agent tends to repeat meaningless searches and struggles with short-term memory limitations. We thus introduce the Retrospective module. This module reflects on the agent's recent thoughts and actions, creating a running summary of important findings. Employed every 6 turns, it provides Sherlock with a long-term game-playing memory. As a result, the newest memory replaces the older one in the Prospective module's system prompt, ensuring that Sherlock's decisions are continually informed by the most up-to-date and relevant information.

Why Simpler Approaches Fail

The implementation of long-term memory in our system, as seen in the system prompt, takes on two distinct forms. The first is a relatively simple form - the search history, denoted as [__KEYWORDS__]. This feature is designed to mirror the in-game function that allows the player to scroll through their search history. The second, more complex form is the running summary, which compiles the most relevant information to solve the mystery every few turns.
Without the search history, the agent is prone to repeated use of the same keywords. Despite explicit instructions to avoid such repetition, the agent often retraced a previous sequence of keywords. However, relying solely on the search history is insufficient. The running memory serves not just as a repository for important factual information, but also as a record of the rationale behind the agent's past search decisions. In its absence, we noticed that the agent often ends up reiterating the same line of inquiry with different keywords.
We conducted experiments using GPT-4 with 8k context-length. Since the total length of all monologues is roughly ~18k tokens, it may be possible to solve the game without an explicit memory system using GPT-4 with 32k context-length. However, this approach would likely be inefficient both in terms of inference and cost. Moreover, given the importance of a coherent reasoning trace, including the agent's chain-of-thoughts could potentially exceed a 32k token length. To facilitate efficient exploration and comprehensive reasoning, some form of context restructuring is necessary, which would involve the removal of superfluous information. Therefore, our approach demonstrated in the Prospective and Retrospective modules provides a more balanced and effective solution.

LLM-powered Game Agents

The rising capabilities of LLMs have fueled a surge of interest in their application within game-playing AI agents. While there have been numerous attempts to incorporate LLMs into RL algorithms, only recently has there been efforts to explore no-RL, no-gradient approaches that utilize LLMs without any additional training.
Two concurrent works have focused on the game of Minecraft. SPRING (Wu et al., 2023)'s approach demonstrated that a strategically designed agent, powered by LLMs, could outperform existing RL-based solutions on the Crafter benchmark, which is derived from Minecraft. Likewise, Voyager (Wang et al., 2023), with its intricate designs of LLM modules for planning and action execution, showed a dramatic improvement in efficiency for tech tree mastery and generalization to unseen tasks.

What skills are required for Her Story vs. Minecraft?

Her Story and Minecraft represent two distinct facets of the gaming world, each requiring a unique set of skills and approaches. Her Story is a non-linear, narrative-driven game where the primary objective remains vague, and success is often subjective, hinging upon the player's understanding of the complex unfolding mystery. On the other hand, Minecraft offers an open-ended environment, where the player can interact with a mutable world, construct complex structures, and devise survival strategies.
As such, Her Story demands a high degree of deductive reasoning, inferential thinking, and an ability to synthesize information from incoherent sources, whereas Minecraft requires spatial understanding, resource management, and strategic planning. In both cases, traditional RL approaches may struggle due to the lack of distinct reward functions and necessarily long-horizon tasks. Moreover, Her Story's ambiguity and narrative complexity present a different set of challenges compared to Minecraft's open-ended environment. So while it might be tempting to view one game as inherently more challenging than the other, such a comparison may be reductive, given the distinct skill sets each one demands.

Limitation

1. Sherlock was unable to utilize all the visual elements in the video. As it is difficult to convert all the information into text, only time, the length of the characters’ hair, and their clothing were provided to Sherlock. This limited environment poses challenges for Sherlock to deduce in detail. For instance, in this game, it is crucial to differentiate between the twins, Hannah and Eve, who have identical faces, and one decisive factor for distinguishing them is a tattoo on their arms. While an average gamer could infer the identities of Hannah and Eve based on the presence or absence of this tattoo, it is impossible for Sherlock. In the future, it will be necessary to convert more visual information into text to overcome such limitations.
2. The level of detail in Sherlock’s deductions is somewhat lacking. During the process of inferring and summarizing from the video, Sherlock often fails to grasp important information. This may be due to the inability to utilize visual elements, as mentioned earlier.
3. Sherlock’s deductions sometimes deviate from the intended solution in the game. For example, the intended answer in the game is “Hannah kills Simon, and Eve assists her as an accomplice,” but Sherlock speculates, after watching the videos, that “Eve killed Simon out of jealousy for Hannah’s relationship with him.” This seems to be the result of Sherlock making subjective deductions based on fragmented information. However, with further execution of the current model and accumulating more information, it is expected to obtain accurate deduction results.

Future Direction

To properly evaluate Sherlock, there are a few caveats and future work we must consider.

On Textification

It is inevitable that tranferring information from one modality (visual) to another (text) results in information loss. The process of "textifying" Her Story introduces an intriguing conundrum pertaining to the characterization of the woman appearing in the video clips. In the original game, players encounter the same woman across different footages, leading them, quite reasonably, to initially assume that all footages feature the same individual. A critical juncture in the game is the potential realization that there might be two women featured in the interviews - an interpretation that remains inherently ambiguous.
When translating the visual game into a text-based version, our descriptive approach is somewhat limited in its ability to convey this subtle narrative design. The visual descriptions of the woman in each video merely identify her as "a woman", without further distinguishing characteristics. Considering that there are seven distinct sessions of videos, with each session featuring the woman in a different outfit, it is plausible for the AI agent to interpret these as seven separate individuals.

Textification example — An example of textification.

The implications of this on the game's difficulty remain unclear. On one hand, it might simplify the game since Sherlock does not initially fall into the trap of assuming all videos feature the same person. On the other hand, it could potentially increase the challenge as Sherlock is now tasked with determining which among the seven individuals are the same person. Furthermore, there are instances where the woman in the session does not explicitly identify herself. We mention this caveat as there may be a difference in gameplay between Sherlock and a real human player.

On Contamination

A potential limitation of this work relates to the issue of test set contamination and information leakage, given that the game Her Story was released in 2015, six years prior to the knowledge cutoff of GPT-4 in 2021. It is conceivable that GPT-4 might have been exposed to some level of information about Her Story during its training phase, such as narrative details, plot interpretations, or gaming strategies culled from various online sources.
However, we emphasize that a familiarity with Her Story, even a thorough understanding of its narrative intricacies, does not equate to the ability to play the game effectively. The gameplay demands not only the interpretation of the narrative but also the strategic implementation of responsive actions within the game environment. The decision-making processes required for successful navigation of the game go beyond mere knowledge of the plot and require the application of reasoning skills.
Furthermore, we have observed that GPT-4 was not able to autocomplete transcripts from the game, as done in Chang et al. (2023), suggesting that it did not memorize the game's content, even if it might have been exposed to broad thematic or narrative elements of Her Story during its training, such as Wikipedia. This lends credibility to the assertion that the game-solving abilities demonstrated by the AI are driven by its inferential and decision-making capabilities, rather than any pre-existing knowledge of the game. Nonetheless, the extent of influence of pre-training exposure to the game's content should be investigated.

On Evaluation

Future work involves the development of more robust evaluation metrics and benchmarks for assessing various approaches to playing Her Story. Given the game's inherent open-endedness, it might be appropriate to consider customizable end goals.

One proposed evaluation strategy involves quantifying the number of search and play actions the agent needs to perform to access a particular proportion of the video content (for instance, 90%). This metric could offer a measure of how efficiently the agent navigates the game environment in relation to the sparse clues dispersed throughout the gameplay (aligned with end condition #1).
Another metric could involve generating a set of questions that yield clear, unambiguous answers about the underlying plot of Her Story. Evaluating the agent's ability to answer these questions accurately would provide insight into the agent's holistic understanding of the narrative (aligned with end condition #2).
It would be valuable to compare the performance of Sherlock with that of the "average" human player. While play time for humans reportedly ranges between 2-7 hours, quantifying Sherlock's performance in relation to the human average could offer a meaningful benchmark.

Most crucially, establishing a more accessible and efficient testbed for conducting experiments is needed, allowing for more trials for statistical accuracy. These evaluation methods would pave the way for more nuanced understandings of AI capabilities in complex, narrative-driven gaming environments.

Conclusion

One of the primary advantages of using LLMs in game AI development is the ability to bypass some of the challenges commonly associated with RL. These include intricate reward design, a high training cost, and the difficulty of providing sufficiently diverse training environments. LLMs offer a rich understanding of the world, an asset that would be underutilized if not harnessed.
Our work extends this nascent line of research, demonstrating the potential of LLMs in challenging games like Her Story, where RL appraoches are challenging due to its artfully vague objective and narrative-driven game design. Thus, we contribute to the growing body of evidence that underscores the versatility and value of large language models in the evolving landscape of game-playing AI.

Acknowledgements

This website is in part based on a template of Michaël Gharbi, also used in PixelNeRF, PlenOctrees, and Plenoxels.