Production LLM: how to harness the power of LLM in real-life business cases.

Intro

It’s been a while since I wrote the last article. I am glad I can resume and continue describing Production Machine Learning implementations. I really hope it helps you, reader, solve your real-life challenges.

Today, nobody is writing about neural networks and complicated 100-line-long codes. It’s all about LLMs and short but powerful solutions involving one or multiple LLMs. This is why there are so many experts in AI now ^_^.

Most clients I work with today are fascinated with Gen AI and what it can do in the next few years. And I agree with them. It is truly fascinating! I’ve seen LLMs powering customer service agents, robots preparing food, and event robotic lines manufacturing drones.

We can do so much with the power of super-big neural networks (like LLMs).

In this series of articles, we will review some possible implementations of LLMs, such as chatbots, reasoning engines, decision-makers, and others.

We will start from the most common use case, the chatbot, to build a solid understanding of the technology and frameworks.

We will define the chatbot as an algorithm that provides users with information through human-like communication. The info the chatbot provides is generated based on user questions or requests.

Confusion

Until now, you have probably read quite a few posts on Linkedin and blogs on how to build a chatbot in 3-5-10-20 lines of code. Some of it you have probably tried and even got some results.

And with all this research you’ve made, it can seem quite confusing to understand:

which databases to use in which cases (vector, SQL, etc.),
which frameworks to use and when,
which LLM to use when,

to solve your specific problem. Then, how do you deploy this in production so it works for your customers and not just in your colab?

In this series, we will structure and explain it step by step with examples.

Structure

First, let’s limit the variability and provide structure to the topic by enumerating the possible combinations of database type and engine (algorithm) you can use.

We will not discuss LLMs at the moment, and there are a few reasons for that:

usually, in production implementations, people use Chat GPT-4,
the decision on which model to use depends on budgets, available infrastructure, non-functional and functional requirements, and (most importantly) the use case.

In future articles, we will discuss choosing (or creating) the right LLM for your task. In this series, we will focus on two dimensions:

How do we store and retrieve necessary information.
How our chatbot retrieves and uses the information to reply to the user.

As for the storage, there are there possible ways to store information:

Vector database, where data is stored as vectors and their metadata.
Database supporting SQL queries. It can store data in the form of documents, tables, or others. Most importantly, it supports SQL queries as an information retrieval algorithm.
Pandas Dataframe. In this case, we load data from any possible source into memory and use it in our algorithm.

Are there any other possible data storage methods we can use? Yes, for example, graph databases. But the three above are most commonly used and cover 95% of your use cases.

We will review four types of algorithms that can use LLm to generate answers to user queries or questions:

Basic information retriever. This algorithm requests data from the database, provides this information to LLM, and requests LLM to generate a human-conversation-like answer based on the question and data provided.
Basic Agent. This is a bit more advanced algorithm that uses LLM to plan reasoning sequences. The most common sequence is Question-Thought-Action-Observation, where the algorithm uses LLM to retrieve information from the database and then tries to understand if the retrieved data is good enough. If the data is ok, it will try to use LLM to generate the answer; if not, it will try to retrieve new information. We will investigate this logic and process in more detail further.
Agent with Tools. This is an upgraded version of the previous Agent. In this case, apart from the database retriever, it can also use other tools. It could be pretty much any program that can receive a request from the Agent and provide the answer.
Finally, the coolest kid in the yard – Milti-Agent. Here, we combine multiple LLMs to perform different tasks. Some of them get their own Agent and can reason and perform reflections. At this point, our Retrieval Augmented Generation pipeline can turn into a Graph with additional nodes and connections, meaning it can have multiple turning points or logic switches, making it non-linear (and interesting).

These are the main types of algorithms that you can use to power your chatbot in production. Are there any more? I would say 98% of the cases would fall in one or another type from above. But, of course, there is always ‘your option’ and something new and interesting.

Let us summarize all combinations of storage and algorithm types in the table like so:

	Retriever	Base Agent	Agent with Tools	Multi-agent RAG Graph
Vector	Vector information retriever	Agent with vector information retriever	Agent with vector information retriever and tools	Multi-agent RAG Graph with vector information retriever
SQL	SQL database data retriever	Agent with SQL database data retriever	Agent with SQL database data retriever and tools	Multi-agent RAG Graph with SQL database data retriever
Dataframe	Pandas dataframe data retriever	Agent with Pandas dataframe data retriever	Agent with Pandas dataframe data retriever and tools	Multi-agent RAG Graph with Pandas dataframe data retriever

I will show you how to implement each type in this series of papers. Some might be more common (like Vector Retriever), and others require custom implementation (like SQL Retriever or Agent with Tools).

But it is important to remember that you should choose technology for your Business Case and not vice versa.

Therefore, it is time to discuss which data storage and algorithm type is better in which cases.

Use Cases

First of all, you should consider the type of information you need the bot to retrieve and use.

If your bot works with textual data such as articles and documents, the Vector database would be the optimal data storage solution.

If you are building a recommendation system that recommends items from the catalog, it is best to use tabular storage, from which the bot can retrieve information using SQL queries. This type of storage works well with product recommenders, a search of certified installers, for example, and similar items.

When your information is complicated and stored either in CSV or JSON formats, it will be easier to load it into Pandas Dataframe in memory and use it like that. In such a case, you can use a dataframe-type information storage and retrieval mechanism.

When you work with factual data and need a bot to reason across the dataset or its portion, I would not recommend using the Vector type of storage and retrieving mechanism. By reasoning across the dataset, I mean operations with the data such as average, last, summarisation for a period, etc.

For example, you want your bot to answer questions like: “Give me top-rated products from the category of hiking boots, which have good discounts at the moment.” – The vector database retrieving mechanism will not be able to help you. In this case, you must run an SQL query or Dataframe filtering to get the needed information.

To define the second dimension, the type of algorithm, you need to understand what you want the bot to do with the information it was able to retrieve.

If you want the chatbot to immediately reply to the user with the information it was able to retrieve – go for a simple Basic information retriever.

Sometimes, users might not know how to phrase their questions, and therefore, the data that the Retriever will get from the database using the user’s request might not be what the user wants.

In such cases, you might want your bot to be able to rephrase/change the request in the database and get better data.

For example, if a user asks, “I need medium price boots to walk in nature with a good rating”. And your algorithm produced query like:

SELECT title, price FROM products WHERE category = `boots for walking in the nature` ORDER BY rating LIMIT 3

In this case, it will not get any information from the database simply because you don’t have such a category. Instead, you have ‘shoes for active sport’ or ‘hiking shoes’ categories.

A simple retriever will not get any results and will return to the user with the message, “Sorry, but we don’t have anything for you.”

This can (and should) be improved by turning Retriever into an Agent that can reason. In this case, the Agent can form a second query to check which categories you have:

SELECT DISTINCT(category) WHERE category LIKE('%shoes%')

Pick the one that should satisfy the user request:

SELECT title, price FROM products WHERE category = `shoes for active sport` ORDER BY rating LIMIT 3

The user will be happy to know you have the shoes she wants.

We can make many other examples with Vector data retrieving, but I think you got the idea.

When you want the chatbot to be able to work with user input and try to retrieve the best data, a simple Retriever is not enough, and you need to go for Agent.

This strategic shift opens a whole set of opportunities for you. Namely, two are of significant importance:

Bot Agents can use tools.
You can stack multiple LLMs and create Multi-LLM Agents and Multi-Agent RAG Graphs.

Maybe this sounds not as exciting as it sounds complicated, but I assure you both possibilities are very cool.

The first option (using tools) is pretty straightforward – you can teach your Agent to use one or a few tools to get better results. It can retrieve information from remote sources, call APIs, perform calculations, perform fraud checks, and many more.

The second option is much more interesting. You can use multiple LLMs combined in one Graph. In this case, it is possible to design a few agents, each of which uses its own LLM fine-tuned for a specific set of tasks. This way, the bot can perform very complicated reasoning flows and produce very interesting outputs.

This is needed when you design and implement complex reasoning, such as a robot that prepares food orders for users.

In this case, one Agent can use LLM fine-tuned to retrieve the recipe based on a user request and produce textual cooking instructions. Based on the cooking instructions, other agents in the chain will create a set of high- and low-level codes for cooking devices.

Another example can be complex operations with a Dataframe. Imagine that you are designing and implementing a bot that helps marketing managers analyze data from their Customer Data Platform. A simple vector similarity search would not help here, and even SQL queries might not provide all the needed capabilities.

In this case, you can implement Multi-LLM, which will have two staged reasoning processes:

The first Agent works with the first LLM to generate Python code to retrieve/calculate necessary data from the Pandas dataframe.
The second Agent receives data from the first stage, performs data analysis, and provides a reply to the user.

Can you solve the task with just one Agent? Yes. But if you want to handle errors and complex queries, using multiple agents is a good idea.

Next Steps

I hope what you’ve read so far made sense and that by now, you have a fair understanding of your options and when you should use which combination of data storing and retrieving mechanism and algorithm of the reasoning flow.

If not, no worries. The following publications will provide more details about each of the implementations, and it will become apparent to you.

So, as I mentioned before, we will now dive into the details of each implementation.

Go ahead and read the following article of the series –