All IT projects go through phases and generative AI is no different in that respect. Initial pilot projects are used to test out how technology or software works, and determine whether you get the results that you want to achieve or what the vendor or developer promised. Once these pilot phases are complete, it is time to scale. However, while you might think that scaling up new projects will simply involve deploying more resources, the challenge is that you can find yourself encountering new problems. For generative AI, these scaling issues can be very different.

IT operations staff commonly term these issues “day one” and “day two” problems. Day one issues are those that come up in implementation. In generative AI, this will include preparing your data to use with retrieval-augmented generation, or RAG, through to checking that your approach to chunking and indexing data is effective. RAG involves using pre-trained large language models (LLMs) alongside your company’s own data, so that your generative AI application can provide more relevant, specific, and timely responses rather than just relying on what the LLM was initially trained on.

Data, chunking, indexing, and embedding

From a day one perspective, RAG deployments can face problems around how your data is prepared. For example, the initial phase for data preparation involves turning all your unstructured and structured data into one format that the generative AI system can work with. This involves creating a set of document objects that represent all your company data and contain both the text and any associated metadata.

Textual data is then split into smaller portions called chunks that can be indexed and understood. Chunk sizing can make a huge difference here. For example, you can implement this at the sentence or paragraph level, or as more complex self-referential chunks that get processed into smaller and smaller elements. Whatever approach you choose, these chunks then get indexed and converted into vector embeddings, where they can be stored for future searches.

When a user sends over a query, that query is turned into a vector that is used to search for the most relevant data, which then gets passed to the LLM to consider in its response. While RAG can help reduce AI hallucinations and improve responses, it is not enough on its own. Problems like choosing the wrong LLM, or using the wrong approach to data chunking or indexing, can affect how well your RAG system works and impact the quality of its responses. For example, if you use chunks that are too big, then the LLM will return big chunks of text that might not be relevant to specific requests.

Scaling generative AI applications

Once you get your RAG system working effectively, you may find that you face new challenges. For example, OpenAI’s ChatGPT service was trained and made available for users to ask questions. The number of users grew extremely rapidly—it took only two months to amass 100 million users according to Business of Apps. Companies adopting generative AI and RAG hope that they will see high volumes of user requests coming through as well.

However, have you considered what might happen if your app is wildly successful, and how much pressure this might put on your generative AI infrastructure? Will your generative AI costs scale up alongside the revenue that you aim to generate from it, or are you looking at this as additional revenue on top of your products? What is your prediction on where you will break even and eventually make your desired margin, depending on the level of interest you attract?

For many services, the work that goes into running at scale is just as important as the initial build and design phase to deliver quality responses. This is an example of a day two problem, where you discover that what worked in the testing phase does not scale to meet demand. For instance, how does your RAG deployment cope with thousands or millions of requests from users simultaneously, and how fast can your vector database and LLM integration components parse that data so the system can return a response to the user?

While users might accept slow performance from a free novelty service, they are not so willing to put up with poor response times when they are paying for a service. In analyzing RAG requests, up to 40 percent of latency within transactions can come from calls to the embedding service and vector search service, where RAG matches the right data and shares it back to the user. Optimizing that round trip can therefore have a huge impact on the user experience, for example by caching previous similar requests. At the same time, each round trip will represent a cost in terms of computation resources, particularly in the cloud. Reducing workloads so that companies pay for what they use can help to cut those costs, and also make any spend more efficient too.

Alongside this, developers and IT operations staff will have to look at where they run generative AI workloads. Many companies will start with this in the cloud, as they want to avoid the burden of running their own LLMs, but others will want to adopt their own approach to make the most of their choices and to avoid lock-in. However, whether you run on-premises or in the cloud, you will have to think about running across multiple locations.

Using multiple sites provides resiliency for a service; if one site becomes unavailable, then the service can still function. For on-premises sites, this can mean implementing failover and availability technologies around vector data sets, so that this data can be queried whenever needed. For cloud deployments, running in multiple locations is simpler, as you can use different cloud regions to host and replicate vector data. Using multiple sites also allows you to deliver responses from the site that is closest to the user, reducing latency, and makes it easier to support geographic data locations if you have to keep data located in a specific location or region for compliance purposes.

Ongoing operational overhead

Day two IT operations involve looking at your overheads and problems around running your infrastructure, and then either removing bottlenecks or optimizing your approach to solve them. Because generative AI applications involve huge volumes of data, and components and services that are integrated together, it’s important to consider operational overhead that will exist over time. As generative AI services become more popular, there may be issues that come up around how those integrations work at scale. If you find that you want to add more functionality or integrate more potential AI agents, then these integrations will need enterprise-grade support.

Choosing your components and integrating them together yourself allows you to take a best of breed approach around your application, while using a microservices approach can also make it easier to support more integrations or functionality over time. However, DIY also means that you are responsible for all of that integration and management, which can add up over time. The alternative is to look at a stack-based approach, where support for different tools or integrations has been implemented for you. A pre-built stack allows your team to focus on building applications, rather than implementing infrastructure, and should also simplify your operations.

Once you have your generative AI application up and running, you will have to operate and support that service so that it meets user expectations around performance and quality. As you get your setup completed, you will discover new potential problems that are specific to RAG deployments as well as some traditional IT management issues like availability and cost. As you scale up, your focus will have to change from day one problems to day two challenges. Taking a stack-based approach can help in this regard, letting you concentrate on delivering the best possible service to your users.

Dom Couldwell is head of field engineering EMEA at DataStax.

Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.