When players talk to AI characters in a game like Shouldermen, the experience feels simple: you speak, the character listens, and it responds with a unique voice. Behind that simplicity is a carefully orchestrated stack of AI services working in real time. In this post, we break down exactly how GPT-5, Eleven Labs, and Deepgram come together to let you have genuine, unscripted conversations with an angel and a devil who sit on your shoulders.

If you want the broader story of how Shouldermen came to be, start with our earlier post on building a voice-controlled AI game. Here, we go deeper into the technical machinery.

How GPT-5 Generates Character Dialog in Real Time

Shouldermen is one of a growing number of games that use GPT-5 to produce AI generated dialog in real time rather than relying on pre-written scripts. Each character — the angel and the devil — runs on its own system prompt that defines personality, vocabulary, moral perspective, and knowledge of the current puzzle state. When the player speaks, their transcribed words are fed into the model along with the full conversation history and game context.

GPT-5's improvements in instruction following and contextual consistency make it particularly well suited for this kind of work. The angel maintains a patient, encouraging tone across dozens of exchanges, while the devil stays sardonic and subversive, even when both are responding to the same player input. The model handles these distinct personas without blending them, which was a persistent challenge with earlier models. We stream the response token by token so that speech synthesis can begin before the full reply is generated, a technique that shaves critical milliseconds off the perceived response time.

Distinct Voices with Eleven Labs

AI generated dialog is only half the equation. For players to truly feel like they are talking to AI characters in a game, those characters need to sound distinct and alive. Eleven Labs provides the text-to-speech layer that gives the angel and devil their own voices.

We designed two custom voice profiles: the angel speaks with a warm, measured cadence, while the devil's voice is sharper, faster, and carries a subtle edge of amusement. Eleven Labs supports streaming synthesis, which means we can begin playing audio as soon as the first chunk of GPT-5's response arrives. This streaming pipeline is essential. Without it, the player would wait for the entire response to be generated and then synthesized — a delay that would break immersion entirely.

Voice quality also matters for sustaining belief in the characters. Eleven Labs produces output that avoids the robotic flatness that plagues many TTS engines. Inflection, pacing, and emphasis shift naturally based on sentence structure, which helps each line of AI generated dialog feel like something a character would actually say.

Capturing Player Speech with Deepgram

Shouldermen is an AI powered game with voice input, which means speech recognition has to be fast and accurate. We use Deepgram's real-time transcription API to convert the player's spoken words into text that GPT-5 can process. Deepgram operates on a streaming model: audio is sent in small chunks and transcription results come back incrementally, with final results arriving as soon as the player finishes speaking.

Accuracy matters more here than in a typical transcription use case. If the speech recognition misinterprets a key word — say, "angel" as "angle" — the entire downstream response can veer off course. Deepgram's model handles conversational speech well, including the kinds of incomplete sentences, restarts, and informal phrasing that players naturally use when they talk to AI characters in a game. We also tuned the voice activity detection to distinguish between intentional speech and background noise, which is critical for a voice controlled indie game where players may be in varied acoustic environments.

The Game Loop: Connecting the Stack

The full pipeline works as a continuous loop. Deepgram listens and transcribes. The transcript is sent to GPT-5 with the appropriate character prompt and game state. GPT-5 streams its response. Eleven Labs synthesizes the response into audio. The audio plays through the character's avatar. The cycle then resets, ready for the player's next input.

Every component runs asynchronously. The game does not wait for one service to finish before engaging the next. This concurrency is what makes the experience feel conversational rather than transactional. For a deeper look at how Outdoor Devs approached the architecture, the developer page covers the broader philosophy behind our AI-first design.

Latency Challenges and Solutions

Latency is the central technical challenge for any AI powered game with voice input. Human conversation has a natural turn-taking rhythm, and if the gap between the player finishing a sentence and the character beginning to respond exceeds roughly one to two seconds, the illusion of a real conversation breaks down.

We attack latency at every stage. Deepgram's streaming transcription eliminates the delay of waiting for a complete utterance. GPT-5's token streaming lets us begin synthesis before the full response exists. Eleven Labs' streaming synthesis lets us begin playback before the full audio is rendered. The result is a pipeline where each stage overlaps with the next, compressing total latency from what would be five or six seconds into something closer to one. We also cache certain game-state context to reduce prompt assembly time and maintain persistent connections to each API to avoid handshake overhead on every turn.

Making AI Characters Feel Natural

Technical performance is necessary but not sufficient. For indie games with AI characters to succeed, the characters themselves have to feel like more than language models with microphones. This is a design challenge as much as an engineering one.

We found that the most important factor is consistency. Players quickly develop a mental model of each character's personality, and any deviation — the angel suddenly being sarcastic, the devil offering sincere advice — breaks trust. Careful prompt engineering, combined with GPT-5's stronger adherence to system instructions, keeps both characters reliably in character across long play sessions.

Pacing also matters. Real people do not respond instantly. We introduce subtle, variable pauses before characters begin speaking, which paradoxically makes the interaction feel more natural than an immediate reply would. Combined with Eleven Labs' expressive synthesis, these small timing choices help players forget they are talking to software.

Shouldermen represents one approach to a question the entire industry is exploring: what happens when you let players have real, unscripted conversations with game characters? The technical stack we have described here — GPT-5, Eleven Labs, and Deepgram — is the foundation, but the real work is in making all three services disappear behind characters that players actually want to talk to.