LangChain, Haystack, LlamaIndex and GripTape: Matching Your Needs in LLM Orchestration

David Davalos, Machine Learning Engineer
|
November 9, 2023

Introductory paragraph.

Large language models (LLMs) have revolutionized the entire informatics industry and beyond, breaking numerous paradigms in programming, web development, and tool creation, while also influencing the emergence of new products. However, these models are constrained by their training on fixed data limited to a specific timeframe, making updates costly. To address this, traditional methods like fine-tuning or reinforcement learning present their own challenges, the former is also costly and the latter still a research problem. Frameworks for LLMs orchetration offer a promising alternative to overcome these limitations.

This is the continuation of our last entry about LLMs orchestration.

Disclaimer:The assessment of these frameworks is based on their current developmental stage, as they are continuously evolving. The authors of this blog entry are not associated with any of the frameworks discussed.

LLM orchestration serves as an effective solution to the memory/knowledge constraints within chatbots based on pre-trained LLMs. Beyond merely resolving this issue, it allows for the integration of various data sources, such as PDFs, plain text, webpages, emails, and even Wikipedia articles, directly into the memory.

The fundamental concept behind LLM orchestration is to leverage the core abilities of the language model, primarily its language comprehension, to generate natural language "commands." The ultimate goal in creating a chatbot or query agent is to utilize proprietary data or collect new information, especially in scenarios involving scrapers.

This blog entry aims to compare and evaluate the performance of two key LLM orchestration platforms—LlamaIndex and GripTape—specifically in the context of constructing a chatbot using data extracted from multiple PDFs. The content of these PDFs is known, enabling a comprehensive assessment of response quality.

While tools like LlamaIndex, GripTape, and LangChain offer similar functionalities in constructing basic structures such as simple chatbots with additional knowledge, their underlying philosophies shape their distinct capabilities. LlamaIndex focuses on data ingestion and storage, while GripTape prioritizes agents, tools, and pipelines. The choice between these approaches significantly influences the ease of customization, scalability and complexity in developing tailored chatbots.

The prioritization of specific tools, often easily fine-tuned, can significantly impact the scalability of the desired solution. Consequently, the implementation of LLM orchestration is heavily influenced by the problem at hand. Given that these tools are developed within the same programming language, a potential strategy involves combining these platforms to leverage their collective strengths.

What are the components of LLM frameworks?

Before delving into the comparison, let's examine what we consider as the fundamental components of LLM orchestration tools. While LLMs serve as the primary resource, their constrained memory, known as the context window, necessitates mechanisms to introduce new knowledge into these frameworks. One common approach is through the utilization of vector databases, which store and semantically query text segments.

Another crucial aspect involves the use of "prompts," which serve as engineered queries directly to the LLM. Users fill certain sections of these prompts with information, enabling tailored responses. Additionally, these tools incorporate agents and auxiliary resources. Although the definitions may vary between frameworks, their core purpose is to augment the chatbot with supplementary resources beyond the LLM's capabilities. For instance, when an LLM struggles to provide an answer, such as performing complex numerical calculations, a Python module can be summoned. In such scenarios, the LLM decides which tool to employ, and this entire decision-making and computational process is often encapsulated within an "agent."

LlamaIndex

Formerly known as GPT Index, LlamaIndex focuses heavily on the ease of ingesting various data types, facilitated by the utilization of LlamaHub. LlamaHub is a community-maintained set of data connectors, ranging from database queries to direct connections with platforms like Slack, Telegram, Yelp, and more. Within LlamaIndex, all documents are abstracted under a class called "index," which encompasses various indexing operations, allowing seamless manipulation among these documents. This indexed data becomes readily accessible for querying purposes. While LlamaIndex possesses the capacity to integrate tools, it is not the platform's primary forte.

Probably one of the most notable features of LlamaIndex are hypothetical embeddings, being hallucinations attempting to answer the query, such that it can be used for semantically search the true answer.

Griptape

Griptape offers a comprehensive framework for developing complete LLM applications, much like LangChain. It excels in creating highly tailored chatbots, providing a straightforward and intuitive experience for developers. Within Griptape, the process of constructing agents—essentially equipped chatbots—is simplified once the necessary tools have been defined. Notably, documents can be readily transformed into tools specifically crafted for particular queries, enhancing the platform's versatility.

Moreover, Griptape functions as a high-level utility where the clarity in developing customized tools doesn't compromise abstraction. This balance ensures that design and scalability remain robust aspects of the platform.

Constructing a custom chatbot (LlamaIndex vs Griptape)

One obstacle common to both Griptape and LlamaIndex is that their documentation primarily consists of examples rather than detailed information about the modules. Understanding how agents are constructed within LLamaIndex is particularly challenging. Griptape, on the other hand, faces issues where some examples don't function straight out of the box. Getting past these challenges is crucial for creating a customized chatbot that uses agents to represent individual PDF files.

Our experience with Griptape was notably smooth. Defining agents and launching a fully customized chatbot required only a few lines of code. Customization also demanded less fine-tuning compared to LlamaIndex. As previously mentioned, Griptape prioritizes tooling, which is evident in its user-friendly interface. Conversely, LlamaIndex, despite its focus on indexing and data organization, lacks sufficient out-of-the-box guidance in its documentation. Consequently, constructing agents based on the index proved to be a more intricate process and notably underperformed in both accuracy and execution time compared to Griptape.

Both LlamaIndex and Griptape offer direct mechanisms for ingesting PDFs. Various chatbot schemes and fine-tunings have been tested. It was observed that both frameworks exhibit similar performance when a single database encompasses all PDFs. However, when constructing agents for each PDF, LlamaIndex tends to struggle unless the query is extremely precise. In contrast, Griptape adeptly manages the usage of agents, demonstrating faster and more accurate responses. Notably, Griptape displayed greater resilience against unconventional queries and never experienced crashes.

Final remarks

Our assessment leads us to consider Griptape as a more comprehensive and relatively well-documented framework. LlamaIndex, on the other hand, primarily excels as a robust tool for indexing, storage, and queries. It can be effectively combined with LangChain and Griptape to string together queries in a more intricate workflow.

Detailed Comparisons: LlamaIndex vs. GripTape
Aspect Llama Index Griptape
Website llamaindex.ai griptape.ai
Cost & model Open source; By default it uses gpt-3.5-turbo, but any LLM can be used (including local LLMs) Open source; Uses by default gpt-4, also for constructing context vectors
Funding 8.5 MM (Jan-2023, source: businesswire) 14.6 MM (Nov-2023, source: tracxn)
Out-of-the-box integrations and tools Straightforward local ingestion of many document formats from a given folder, and an overall easy ingestion of online data using LlamaHub Default useful tools (math calculations, weather etc.) to create agents with high-level code. Modules for creating custom tools
Community support Great support, especially with regard to LlamaHub Very scarce, but the product is relatively new
Complication The documentation is mostly theoretical and the code is explained with examples. Although there are examples for many scenarios, the framework is notably low level Well and concise documentation, although mostly example-based, too; Saving context vectors locally seems to not be implemented, but saving them in vector DBs like Pinecone, is
Simplified workflow description High level framework for ingesting, querying and indexing data, but low level and involved to build agents. Swappable storage components with optional fine-grained control. High level framework for ingesting and tooling. Has several wrapping levels, starting with agents and pipelines, having also workflows for non-sequential Data Augmented Generation (DAG)
Data connectors and tools This is the strongest feature of LlamaIndex. It has LlamaHub, a community driven set of connecting plugins, including Wikipedia, Yelp, Slack, Gmail, Google Calendar and docs, spotify etc. Loaders for PDFs, SQL DBs, CSV, text and Web.
Conversation "memory" retention Has a structure called ConversationBufferMemory By default it has a structure called ConversationMemory, but also has SummaryConversationMemory and DynamoDbConversationMemoryDriver. It can be stored locally
Output parsers Output quality enhanced with hypothetical embeddings (what are they?) Images, CSV, JSON and PDF files and many other formats
Debugging It crashes when several agents generate a response, since documentation is scarce, it is difficult to solve Documentation is scarce and vector store does not work out-of-the-box. No further bugs were detected
Other features Hypothetical embeddings, although using them is slow Is fast, and from the log it can be appreciated that it tries several levels of processing to always try to get a reliable answer to queries
Other comments Is relatively slow and often crashes when using agents, particularly query sub-engines Separate documents as tools are easy to set up, and give better answers.


Detailed Comparisons: LangChain vs. Haystack
Aspect LangChain Haystack
Website https://www.langchain.com https://haystack.deepset.ai
Cost and model Open source; business model appears to be value-added services on top of open source to large enterprises. Open source; presumably supports parent company deepset’s other products like deepset Cloud
Funding $10 MM $45.2 MM (deepset, parent company)
Out-of-the-box integrations and tools Many, e.g., AWS Lambda, APIFY, Huggingface, YouTube Fewer, but allows user to create custom tools / pipelines
Community support Very Good OK
Complication Rather complicated. Abstracts almost every concept into a class, e.g. even a simple Python f string is abstracted into a “Prompt Template”. Your opinion of this may depend on your affinity for very object-oriented code bases. Simpler to open, read, understand and use out of the box. No shortage of classes, but they seem slightly more intuitive in design and in how you are supposed to use them together.
Simplified workflow description Prompts and other Model I/O components feed into Chains. Agents may route the conversation to different Tools or ToolKits. Each Node performs a task; multiple Nodes make up a Pipeline. Agents may route the conversation to different Tools.
Data connectors and tools Robust set of tools like loaders, transformers, embedding models, vector DB interfaces, retrievers. Set of many tools like converters, classifiers, retrievers with seemingly slightly less bells and whistles.
Conversation "memory" retention Several different options including conversation history or summary, knowledge graphs, token length, different flushing options, storage in a DB and targeted retrieval Fewer options; handled less explicitly. Integrates with REDIS to store memory across conversations
Output parsers Is more flexible in structuring the model response; parser can be a List, Datetime, Enum, Pydantic, Auto-fixing, Retry or Structured Output Parser Limited parser options; parser can be either the default BaseOutputParser or AnswerParser which parses the model output to extract the answer into a proper Answer object using regex patterns
Debugging Proprietary debugging framework, LangSmith, currently in beta. Stack trace during normal IDE debugging is inflated by everything being a class. No special tools outside of normal IDE debugging
Other features Callbacks to hook into various stages of the application. Asynchronous support. Examples of autonomous agents. Moderation of response fallacies. OCR support. Pre-built REST API to interface with it. Examples of integration with Rasa. Annotation tool.
Other comments More tools, but some may crash; entire app may crash when query is meaningless. Can’t retrieve data after 2021. Seems to handle meaningless queries better.