"I want Llama3 to perform 10x with my private knowledge" - Local Agentic RAG w/ llama3

Use Blip to create an article or summary from any YouTube video.

Artificial intelligence (AI) has the potential to revolutionize many aspects of our lives, and one area where it is already making a significant impact is in knowledge management. Every organization has a vast amount of weekly documentation and meeting notes that are often disorganized and difficult for human beings to digest. However, with the help of large language models, this problem is finally being solved.

Large language models can read and understand different types of data, and retrieve answers to our questions, making it possible to quickly and easily access the information we need. In fact, the end of last year saw a big discussion about whether search engines like Google would be disrupted by large language models. With the ability to provide personalized answers based on their extensive knowledge, why would we still need to use traditional search engines?

We are already seeing this shift happening, with more and more people turning to platforms like ChatGPT or Plexity to answer their daily questions. There are also platforms like Link that focus on knowledge management for corporate data. Building an AI chatbot that can chat with your PDFs, PowerPoints, or spreadsheets is now easier than ever, but building something that can actually answer basic questions is a different story.

There is a significant gap between what people think AI is capable of today and what it is actually capable of. Over the past few months, I have been trying to build different types of AI applications for various business use cases to determine what works and what doesn't. Today, I want to share some of my learnings with you on how to build a rock-solid application that is both reliable and accurate.

There are two common ways to give a large language model your private knowledge: fine-tuning or training your own model, or putting knowledge into the prompt (also known as in-context learning or retrieval-augmented generation). Fine-tuning involves baking your knowledge into the model weights, which can provide precise knowledge with fast inference. However, it is not a common knowledge about how to fine-tune a model effectively, and you also need to prepare the training data properly.

On the other hand, in-context learning involves putting your knowledge into the prompt, which is a lot more common and widely used. Instead of getting the large language model to answer users' questions directly, we try to retrieve relevant knowledge and documents from our private database and insert those knowledge as part of the prompt, so that the large language model has additional context.

Building a production-ready retrieval-augmented generation (RAG) application for business is actually really complex, despite how simple and easy it is to start and build a proof-of-concept. There are many challenges and problems with simple RAG implementation. For example, real-world data is really messy, and many of them are not just simple text paragraphs. They can be a combination of different image, diagram, chart, and table formats. If you just use normal data passer or data loader for a PDF file, you will often extract incomplete or messy data that the large language model cannot easily process.

Another challenge is that even if you create a database from the company's knowledge to accurately retrieve relevant information based on the question, different types of data and documentation often involve different retrieval methods. For example, if your data is actually spreadsheets or SQL databases, vector search might not be the best answer, while keyword search or SQL search will yield better and more accurate results.

To build a reliable and accurate RAG application, you need to consider many different tactics to mitigate these risks. One tactic is to use better data preparation, which is probably one of the most important but also the easiest ways to improve the quality immediately. The challenge is that the real-world data is really messy, especially when dealing with format like PDF or PowerPoint.

To prepare data effectively, you can use large language model native parsers like Llama Part, which is a parser specifically designed for converting PDF files into a large language model-friendly markdown format. It has a higher accuracy in terms of extracting table data compared with other types of parsers, and it is a really smart parser where you can pass on prompts to tell the parser what the document type is and how to expand them to extract information.

Another parser you can use is Fire Craw, which is introduced by MBO, where they provide a scraper that turns website data into clear markdown format that large language models can easily process. By reducing the amount of noise that the large language model receives, you can improve the accuracy of the answers generated.

In conclusion, building a reliable and accurate RAG application is a complex task, but by using better data preparation, optimizing the chunk size, and using advanced retrieval tactics, you can mitigate many of the risks and build a production-ready application that can provide value to your organization.