Large language models by themselves are less than meets the eye; the moniker “stochastic parrots” isn’t wrong. Connect LLMs to specific data for retrieval-augmented generation (RAG) and you get a more reliable system, one that is much less likely to go off the rails and “hallucinate,” which is a relatively nice way of describing LLMs that bullshit you. Connect RAG systems to software that can take actions, even indirect actions like sending emails, and you may have something useful: Agents. These connections, however, don’t spring into being fully grown from their father’s forehead. You need a framework that ties the components together and orchestrates them.
What are LLM application frameworks?
LLM application frameworks are basically plumbing, or, if you like fancier and more specific words, orchestration providers. In a RAG application, for example, LLM application frameworks connect data sources to vector databases via encoders, modify user queries by enhancing them with the result of vector database lookups, pass the enhanced queries to LLM models along with generic system instructions for the models, and pass the models’ output back to the user. Haystack, for example, talks about using components and pipelines to help you assemble LLM applications.
LLM application frameworks help by reducing the amount of code you need to write to create an application. The fact that these application frameworks have been designed and coded by experts, tested by thousands of programmers and users, and used in production, should give you some confidence that your “plumbing” will perform correctly.
Use cases
Use cases for LLM application frameworks include RAG, chatbots, agents, generative multi-modal question answering, information extraction from documents, and many more. While these use cases are all related by the incorporation of LLMs and (usually) vector search, they have somewhat different purposes.
RAG is a way to expand the knowledge of an LLM without retraining or fine-tuning the LLM. This helps to ground the model and can also help to focus it on specific information. RAG’s three steps are retrieval from a specified source, augmentation of the prompt with the context retrieved from the source, and then generation using the model and the augmented prompt.
The information source can be documents, the web, and/or databases. You could give this additional information to the model as part of the query, as long as you didn’t exceed the model’s context window. Even with a huge context window, though, you could run into a “needle in a haystack” problem searching large source documents, meaning that some models might miss specific relevant facts if they are surrounded by too much irrelevant material.
Instead, you could encode your text and media information as a high-dimensional floating-point vector using an embedding model such as Word2vec (text only) or DeViSE (mixed text and media) and store it in a vector database such as Qdrant or Elasticsearch. Then you could use the same embedding model to encode your search term and find the K nearest items in terms of a distance metric, such as the cosine or Euclidean distance, through a vector search. Then you would augment the query with the selected source information and send it to your LLM. In most cases, the results from the retrieval-augmented generation will be grounded in the information you have provided.
Chatbots are designed to mimic human conversation; they go back at least to Joseph Weizenbaum’s ELIZA program, published in 1966. Modern chatbots expand on simple LLM or RAG queries by using some kind of memory to keep track of the conversation, and using previous queries and replies to enhance the context of each new query.
Agents are LLM applications that call other software to perform actions. Microsoft calls them Copilots. Some frameworks differentiate agents from chains, the distinction being that agents use a language model as a reasoning engine to determine which actions to take and in which order, while chains hard-code sequences.
Some models with the ability to take audio, images, and/or video as input can be used in applications that implement generative multi-modal question answering. For example, I demonstrated how the Gemini Pro Vision model can infer the price of a fruit in one image by identifying the fruit and reading its price in another image, in my review of Google Vertex AI Studio.
Information extraction from documents can be more complicated than you might think. For example, if 20 documents in different formats are scanned to provide input to a mortgage loan application processor, the application needs to recognize forms, OCR the numbers and labels, and pull out the relevant tagged values to populate the summary form for the human loan officer.
Programming languages
Programming Languages supported by LLM application frameworks include Python, C#, Java, TypeScript, and JavaScript. LangChain and LlamaIndex have implementations in Python and TypeScript/JavaScript. Semantic Kernel has implementations in C#, Python, and Java, but not all SK features are supported in all of these programming languages. Haystack is implemented exclusively in Python.
Haystack
Haystack is billed as an open-source framework for building LLM applications, RAG applications, and search systems for large document collections. Haystack is also the foundation for deepset Cloud. deepset is the primary sponsor of Haystack, and several deepset employees are heavy contributors to the Haystack project.
Integrations with Haystack include models hosted on platforms, such as Hugging Face, OpenAI, and Cohere; models deployed on platforms, such as Amazon SageMaker, Microsoft Azure AI, and Google Cloud Vertex AI; and document stores, such as OpenSearch, Pinecone, and Qdrant. In addition, the Haystack community has contributed integrations for tooling purposes such as evaluation, monitoring, and data ingestion.
Use cases for Haystack include RAG, chatbots, agents, generative multi-modal question answering, and information extraction from documents. Haystack provides functionality for the full scope of LLM projects, such as data source integration, data cleaning and preprocessing, models, logging, and instrumentation.
Haystack components and pipelines help you to assemble applications easily. While Haystack has many pre-built components, adding a custom component is as simple as writing a Python class. Pipelines connect components into graphs or multi-graphs (the graphs don’t need to be acyclic), and Haystack offers many example pipelines for common use cases. deepset Studio is a new product that empowers AI developers to design and visualize custom AI pipelines.
For a deeper look at Haystack, see my review.
LangChain
LangChain enables language models to connect to sources of data, and also to interact with their environments. LangChain components are modular abstractions and collections of implementations of the abstractions. LangChain off-the-shelf chains are structured assemblies of components for accomplishing specific higher-level tasks. You can use components to customize existing chains and to build new chains.
Note that there are two kinds of language models in LangChain, LLMs and ChatModels. LLMs take a string as input and return a string. ChatModels take a list of messages as input and return a ChatMessage. ChatMessages contain two components, the content and a role. Roles specify where the content came from: a human, an AI, the system, a function call, or a generic input.
In general, LLMs use prompt templates for their input. That allows you to specify the role that you want the LLM or ChatModel to take, for example “a helpful assistant that translates English to French.” It also allows you to apply the template to many instances of content, such as a list of phrases that you want translated.
LangChain has six modules:
- Model I/O is an interface with language models.
- Data connection is an interface with application-specific data.
- Chains construct sequences of calls.
- Agents let chains choose which tools to use given high-level directives.
- Memory persists application state between runs of a chain.
- Callbacks log and stream intermediate steps of any chain.
For a more thorough overview of LangChain, see my explainer.
LlamaIndex
At a high level, LlamaIndex is designed to help you build context-augmented LLM applications, which basically means that you combine your own data with a large language model. Examples of context-augmented LLM applications include question-answering chatbots, document understanding and extraction, and autonomous agents.
The tools that LlamaIndex provides perform data loading, data indexing and storage, querying your data with LLMs, and evaluating the performance of your LLM applications:
- Data connectors ingest your existing data from their native source and format.
- Data indexes, also called embeddings, structure your data in intermediate representations.
- Engines provide natural language access to your data. These include query engines for question-answering, and chat engines for multi-message conversations about your data.
- Agents are LLM-powered knowledge workers augmented by software tools.
- Observability/Tracing/Evaluation integrations enable you to experiment, evaluate, and monitor your app.
For more on LlamaIndex, see my review.
Semantic Kernel
Semantic Kernel is an open-source SDK that fills the same function in Microsoft’s open-source LLM application stack as AI orchestration does in Microsoft’s internal stack for Copilots: It sits in the middle and ties everything together. Copilot is, of course, Microsoft’s name for collaborative AI agents.
Semantic Kernel is the glue, the orchestration layer, that connects LLMs with data and code. It does a bit more, as well: Semantic Kernel can generate plans using LLMs and templates. That’s a step beyond what you can do with function calling alone, and it’s a differentiator for Semantic Kernel.
Semantic Kernel’s planner function takes a user’s “ask” (Microsoft-speak for “request”) and returns a plan on how to accomplish the request. To do that, it uses AI to “mix and match” plugins that you register in the kernel, combining them into a series of steps that complete the task.
A plugin in Semantic Kernel is a group of functions that can be used by an LLM. Semantic Kernel also includes an AI orchestration layer, connectors, and planners. The orchestration layer ties the plugins and connectors together. The planners help define the flow of plugins.
The Semantic Kernel kernel (note the lack of capitalization for the second instance of “kernel”) is essentially a traffic cop for AI applications. It selects AI services, renders prompts, invokes AI services, parses LLM responses, and creates function results. Along the way it can invoke other services, such as monitoring and responsible AI filters.
For a closer look at Semantic Kernel, see my review.
Which framework should you use?
Honestly, any of these four frameworks — Haystack, LangChain, LlamaIndex, and Semantic Kernel — will do the job for most LLM application use cases. As they are all open source, you can try and use them all for free. Their debugging tools differ, their programming language support differs, and the ways they have implemented cloud versions also differ. I’d advise you to try each one for a day or three with a clear but simple use case of your own as a goal, and see which one works best for you.