Building Shouldermen: Voice-Controlled AI Game

Every game starts with a question. For Shouldermen, it was this: what if your voice was the only controller you needed? Not button prompts mapped to preset dialogue trees, but actual spoken conversation with characters who listen, think, and respond in real time. That idea became the foundation of the Shouldermen Steam game, and turning it into reality was one of the most challenging and rewarding things I have ever built.

The Angel and Devil on Your Shoulder

The inspiration came from a place most people can relate to: that internal tug-of-war between doing the right thing and giving in to temptation. We all carry a figurative angel on one shoulder and a devil on the other. I wanted to make that metaphor literal and interactive. In Shouldermen, you are trapped in a VR prison with two AI companions—an angel and a devil—who each have their own agenda, their own personality, and their own idea of how you should escape. The catch is that you talk to them using your actual voice, and they talk back. There are no dialogue wheels, no multiple-choice menus. You simply speak, and the game listens.

That concept sounded elegant on paper. Executing it was another matter entirely. If you want to learn more about the game's story and mechanics, head over to our About the Game page for a deeper look.

Building Conversational AI Agents with GPT-5

At the core of Shouldermen is a system for building conversational AI agents that feel like genuine characters rather than chatbots. Early prototypes used simpler language models, and the results were unconvincing. Characters would lose track of context, repeat themselves, or generate responses that broke the fiction. Everything changed when we integrated GPT-5.

As one of the first games that use GPT-5, Shouldermen leans on the model's ability to maintain long-context conversations, stay in character, and reason about puzzle states. Each character—the angel and the devil—operates under a detailed system prompt that defines their personality, knowledge of the game world, and relationship to the player. GPT-5 handles the rest, generating lines of dialogue that feel authored rather than assembled. The result is a genuine AI generated dialog game where no two playthroughs sound the same.

The biggest technical challenge was latency. Players will tolerate a brief pause when a character is "thinking," but anything over a couple of seconds breaks immersion. We spent weeks optimizing our prompt architecture, streaming partial responses, and caching context so the model could reply quickly without sacrificing coherence. Making this AI powered game with voice input feel responsive meant treating every millisecond as a design decision.

Giving Characters a Voice with Eleven Labs

Generating smart dialogue is only half the equation. The words also need to sound right. We chose Eleven Labs for voice synthesis because their models produce natural, expressive speech with low latency—exactly what a dialog driven game Steam players would actually enjoy requires. The angel's voice is warm, measured, and reassuring. The devil's is sharp, playful, and laced with sarcasm. Both voices are synthesized on the fly from the text GPT-5 generates, which means the audio you hear in-game has never existed before that exact moment.

Tuning the voices took more iteration than I expected. Small adjustments to pitch, pacing, and emphasis made the difference between a character who sounded alive and one who sounded like a GPS navigator reading poetry. We also had to handle edge cases: what happens when the model outputs an unusually long sentence, or an exclamation followed by a whisper? Eleven Labs gave us enough control to smooth those transitions without adding perceptible delay.

Making Voice Recognition Work as a Game Controller

A voice controlled video game lives or dies by how well it understands the player. We chose Deepgram for speech-to-text because it offers high accuracy and real-time streaming transcription, two non-negotiable requirements for a game where talking is the primary input. Deepgram listens to the player's microphone feed, converts speech to text, and passes it to the GPT-5 layer in a matter of milliseconds.

The hardest part was not accuracy—Deepgram handles diverse accents and speaking styles well—but rather managing the flow of conversation. When does the game decide the player has finished speaking? How does it handle interruptions, crosstalk, or someone laughing mid-sentence? We built a custom turn-taking system on top of Deepgram's voice activity detection so that conversations feel natural rather than robotic. Silence detection thresholds, barge-in logic, and graceful error recovery all had to be tuned by hand through hundreds of hours of playtesting.

The Early Access Journey

Shouldermen launched into Early Access because a game built on conversation needs real conversations to improve. Every player who speaks to the angel or the devil teaches us something: which prompts confuse the AI, where latency spikes, what kinds of questions players ask that we never anticipated. Early Access is not just a release strategy for us; it is the development process itself.

Feedback from the community has already shaped major updates, from expanded puzzle scenarios to improved voice recognition in noisy environments. If you are curious about the studio behind the game and our development philosophy, visit the Developer page.

What Comes Next

Shouldermen is still growing. We are adding new chapters, refining AI behavior, and exploring multiplayer scenarios where multiple players argue with the angel and devil simultaneously. The dream has always been to prove that voice-driven, AI-powered gameplay is not a gimmick—it is the next frontier of interactive storytelling.

If you want to experience it for yourself, try Shouldermen on Steam and let us know what you think. Your voice is the controller. Use it.

Building Shouldermen: How I Created a Voice-Controlled AI Game