Building a RAG System with LlamaIndex, Pinecone, and GPT-4

2025-07-06

AIRAG

This article is co-authored with Issam Echchariy

06 July, 2025 ・🕒 8 min read

As AI becomes increasingly accessible, we gain powerful opportunities to build intelligent assistants for specific, real-world problems. In my case, that meant helping university students get accurate, instant answers about academic procedures without relying on complex portals.

This article explains how I built AskUiz, a student assistant app powered by a Retrieval-Augmented Generation (RAG) architecture. It combines FastAPI, LlamaIndex, Pinecone, and GPT-4 to answer pedagogical questions about Ibn Zohr university.

Large language models (LLMs) like GPT-4 have revolutionized how machines understand and generate human language. However, despite their remarkable capabilities, these models are limited by static training data and can sometimes produce inaccurate or outdated responses. They don't inherently "know" your specific documents or data. (the MCP protocol does cover this limitation though)

And here comes Retrieval-Augmented Generation, which enhances LLM performance by integrating external knowledge sources into the generation process. The flow is simple:

Index your data (PDFs, internal docs, web pages..)
Retrieve relevant chunks based on user queries
Feed that context into the language model
Generate answers grounded in facts

Instead of relying solely on what the model has memorized during training, RAG systems retrieve relevant documents from a dedicated knowledge base and feed this information into the model to produce more accurate answers.

AskUiz is a practical example of this architecture, we have built it using the following stack:

Tech Stack

Component	Tool
API Backend	Python + FastAPI
Embedding & Retrieval	LlamaIndex
Vector Store	Pinecone
Language Model	OpenAI GPT-4
Rate Limiting	Redis
Frontend	Next
Deployment	Vercel, Render

System Architecture

And here's a high-level overview of AskUiz's architecture:

I leveraged Python's ecosystem with libraries like LlamaIndex for document processing, Pinecone for vector storage, and OpenAI's API for embedding and completion.

Data Collection and Preprocessing

The first critical step was gathering high-quality, domain-relevant data. For AskUiz, this included:

Course and degree information
Administrative procedures documentation
Location and contact details for university institutions
Services and student affairs information

Once collected, data preprocessing transforms raw content into a structured format suitable for embedding and retrieval:

The Preprocessing Pipeline

Data Cleaning: Removing unnecessary characters, correcting formatting, stripping HTML tags, and standardizing encoding
Chunking: Breaking documents into smaller, semantically meaningful segments (I used approximately 500 tokens per chunk with 50-token overlaps)
Metadata Addition: Adding source information, document type, faculty, and other relevant attributes to each chunk
Indexing: Converting content into fixed-size vectors using embedding models

I used LlamaIndex to manage document parsing, chunking, and routing embeddings to Pinecone for storage:

Retrieval from Pinecone

When a user asks a question like:

"How can I get my license diploma from ENSIASD?"

The retrieval process involves:

Embedding the query using the same model used for documents
Searching for the most semantically similar chunks in Pinecone
Returning the top k relevant pieces to use in generation

This ensures GPT-4 has real context instead of hallucinating answers based on general knowledge.

Prompting & Generation with GPT-4

With the relevant context retrieved, we build a structured prompt:

You are a university assistant for Ibn Zohr University. 
Answer the question based ONLY on the context provided below.
If you can't find the answer in the context, say "I don't have 
enough information to answer this question accurately."

Context:
- [Chunk 1 content]
- [Chunk 2 content]
- [Chunk 3 content]
...

User question:
"{user_query}"

Then we send it to OpenAI's GPT-4 via the chat/completions endpoint:

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_with_context}
    ],
    temperature=0.3  # Lower temperature for more factual responses
)

The response is then parsed and sent back to the frontend. This approach grounds the model's answers in the specific documents we've provided, reducing hallucinations significantly.

Frontend Experience (Next.js)

On the frontend, AskUiz uses Next.js and TypeScript to:

Send queries to the backend API
Display GPT-4's answers with minimal latency
Collect user feedback (e.g., helpful/unhelpful)
Track usage with Vercel Analytics

The interface is mobile-first and responsive, designed to work well on campus or in transit. Since many students access the service via mobile devices, optimizing for smaller screens was a priority.

Rate Limiting with Redis

To avoid abuse and manage costs, I implemented IP-based rate limiting:

Maximum 20 requests per day per IP
Tracked in Redis with TTL (time-to-live) keys

This was simple to implement in FastAPI using dependency injection and provides sufficient protection for an MVP while keeping the service accessible.

Deployment

The deployment strategy keeps the system manageable and cost-effective:

Frontend is deployed to Vercel
Backend API runs on Render
Pinecone and OpenAI are accessed as managed services
The whole system is monitored with logs, analytics, and feedback mechanisms

Key Lessons Learned

Quality over quantity. Well-curated data beats large volumes of irrelevant content.
Prompt engineering is critical. The same data with a better prompt returns better answers.
Rate limiting matters. Even MVPs need basic protection against abuse.
Tool selection is important. LlamaIndex + Pinecone + GPT-4 proved to be a powerful combination for RAG.
Users prioritize clarity and speed. Simple UI and fast answers matter more than flashy features.

Tips for Building Your Own RAG System

Start small: Begin by indexing just a few high-quality documents
Use frameworks: LlamaIndex or LangChain help avoid writing boilerplate for parsing and chunking
Choose a scalable vector DB: Pinecone is reliable and scales well, there is also FAISS by Meta a good open-source alternative.
Consider caching: Store popular queries to reduce costs, especially with GPT-4
Implement feedback: Add user feedback mechanisms from day one
Test with real users: Nothing beats actual user testing for uncovering edge cases