At Campfire, we launched a consumer AI chat product called Cozy Friends on Steam and mobile on October 12. Imagine Animal Crossing or The Sims with AI agents. In the first 30 days, users exchanged more than 1.7M messages with our AI agents. This beat our best expectations and validated that we’d finally gotten good at conversational AI products. 

It took us a year of building to get there. We had to build a ton of tools, some of them twice over, and bang our heads against various walls for months to finally make a decent AI chat product. 

I want to build better AI chat products faster, and help others avoid our painful experience. To that end, I’m open sourcing all of my learnings, and launching all of our internal AI tools as a product in Sprites, our all-in-one tool for building, optimizing, and scaling conversational AI agents.

Below I offer seven of the biggest lessons we learned while developing our AI chat product. Most importantly, you should think of your AI chat outputs as a complex function, not as a wrapper to a single large language model (LLM). With that framing, here are my seven tips:

System prompts are a function of user and application state

Your system prompts need to be built and managed like a React app, evolving with user intent and data, rather than like a static HTML web page.

You can think of your system prompt as a function of the application state — it needs to be dynamic and evolve based on the progression of the user journey. It’s not even a piece-wise function made up of two or three static prompts. You need to modify or entirely replace the prompt based on the evolution of the conversation, metadata from chain-of-thought workflows, summarization, personal data from the user, etc. You want to include or omit parts of it at any given user state for a better outcome. This blog post on prompt design from Character.ai is a great resource.

In short, think of prompts as a dynamic set of instructions that need to be maintained to control your user experience, more like the UI elements visible to the user in a given screen of your app, rather than as a one-time set of instructions to be locked at the start of the user journey.

Opt for deterministic outcomes, especially in the early user journey

With most online products, you finely control your user’s “day zero” experience with an intricately built onboarding flow, then you unleash them onto a magical blank canvas to do whatever they want. With an AI chat product, you probably want to keep the same philosophy and build deterministic chat outcomes for your users, especially in their first few days. But then what?

Should the AI bring up a certain topic or suggestion within the first five messages, or be prompted to a certain action on their second day? Should the AI change the topic at certain times to keep the user engaged? Is there a conversational ramp for the activation moment? Do you want to extract some info from your user during onboarding using a chat format to personalize the experience? 

The answer to all of the above is most likely yes if you’re building a consumer product. 

Use model blending

Results improve when you route messages in the same thread to two to six models with orthogonal capabilities instead of always going to the same model. Let’s say you have model A, which is good at prose and role playing, and model B, which is good at reasoning. If you just route every other message between A and B, the outcome over a multi-turn conversation ends up being dramatically better.

Besides running split tests for advanced prompting, this is the easiest win with a huge impact. But choose the models wisely.

Use scripted responses

As amazing as LLMs are, they’re better deployed in a controlled manner for chat rather than as a magical talking box. You can use a smaller model to infer some semantics about the user input, and route to a pre-written response a lot of the time. This will save you a ton of money while actually leading to a better user experience. 

If you can build a simple decision tree with some semantic reasoning for routing to serve a common user journey, you’ll probably end up with a better product than having every single response generated being from an inference.  

Craft amazing conversation starters

We built an entirely separate inference system from our core dialog system to use summaries of previous chats, previous memories, their recent actions in app, and some random seeds for the AI characters to initiate good conversations. If you don’t do this, your AI will produce some version of “Hi! How can I assist you today?” more often than you want. 

The quality of AI-to-AI chats degrades quickly

During user testing, we repeatedly saw the blank canvas problem — users didn’t know what to type to chat. We added a “magic wand” to offer three AI-generated messages in the user’s voice. While it solved a short term user friction, users who used the magic wand churned much faster. When we studied the chatlogs, we found that AI-to-AI chat degrades into a loop of nonsense within a few turns.

Have a clear metric to judge AI output

If you just prompt and test your chatbot yourself for a few messages, and call it good enough… trust me, it won’t be good enough. Your AI outputs need to maintain quality after a 100-turn conversation, across several sessions, and for different user personas. 

You need to try many different variants and build a clear feedback loop, using something like a Likert score or simple ELO score, to choose between variants to see what your users find engaging or useful in chat. 

We found that using another inference with a general purpose LLM to judge the output (e.g., a prompt like “On scale of 1 to 5, how entertaining is this conversation?” running with GPT4o as the judge) produced poor results that were out of sync with users’ feedback.

All in all, the days of vibing your way to a system prompt and calling it a day are long gone. As obvious as it may sound, if your product is AI, then the AI better be great. This will be the number one factor determining your success. The AI novelty era is over. You will need a clear framework and lots of experimentation to delight your users and deliver them value. Good luck!

Siamak Freydoonnejad is co-founder of Campfire.

Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.