RAG Chatbot 1: Project structure
A term one must come across when aiming to deploy a chatbot for the internal use within a company is RAG, translating to Retrieval Augmented Generation. What is it about?
RAG combines a generator model with a retriever model to produce high-quality text. The generator produces an initial draft, while the retriever retrieves relevant existing texts and incorporates them into the output. This hybrid approach can improve coherence, variety, and factual accuracy in generated text. (Meta-Llama-3-8B-Instruct.Q4_0.gguf “What is RAG in context of chatbots?”)
Given a word or a sentence (query), RAG fetches relevant information from a database. Relevant, given a metric, usually similarity metric such as cosine-similarity in the context of text data.
In this blog series we look at setting up a RAG pipeline. Let’s start with some considerations on the project structure.
A good project structure in a Python project is important because it makes the code more organized, maintainable and scalable. It:
- Keeps related files together
- Facilitates easy modification or addition of new features
- Improves collaboration among team members
- Simplifies navigation through the codebase
- Allows for consistent naming conventions
It’s achieved by dividing the project into logical directories like:
src
: for main application logictests
: for running unit tests and integration testsutils
orhelpers
: for reusable functionsdata
: for storing data files, e.g., configuration files or example datasets
So how could a good project structure for a RAG Chatbot look like? We could
embark on a flat structure, where modules live next to the project
configuration, data, etc. For clarity, we decide to provide modules under a
src
source code directory, and keep other directories and files a project user
would usually interact with at the project root:
rag_company_docs/
│
├── config/ # Configuration files and constants
│ └── settings.yaml
│
├── data/ # Input and processed files
│ ├── raw/
│ └── processed/
│
├── dist/ # Distribution
│
├── src/ # Source code
│
├── pyproject.toml # Poetry project configuration
│
└── README.md
Here, we use Poetry to manage the project and most importantly, its dependencies and the virtual environment.
The source code directory src
will look something alike to shown below, where
we split modules corresponding to typical RAG aspects:
src/
│
├── loaders/ # Handlers for different document types
│ ├── base_loader.py
│ ├── docx_loader.py
│ ├── pdf_loader.py
│ └── pptx_loader.py
│
├── ocr/ # OCR-specific preprocessing
│ ├── ocr_utils.py
│ └── ocr_engine.py
│
├── indexing/ # Vector store creation and management
│ ├── index_builder.py
│ └── index_manager.py
│
├── retriever/ # Querying logic
│ ├── retriever.py
│ └── reranker.py
│
├── rag/ # RAG orchestration logic
│ ├── rag_pipeline.py
│ └── prompts.py
│
├── utils/ # Shared utilities
│ ├── file_utils.py
│ └── logger.py
│
├── main.py # CLI or script entry point