Understanding LLMs

Before we dive into code, let's understand what Large Language Models (LLMs) are and how we'll use them.

What is an LLM?

An LLM is an AI model trained on vast amounts of text data. It learns patterns in language and can:

Generate text - Write responses, articles, code
Answer questions - Based on its training data
Follow instructions - Complete tasks you describe
Reason - Work through problems step by step

Popular LLMs include:

OpenAI's GPT-4, GPT-4o, GPT-4o-mini
Anthropic's Claude
Meta's Llama
Google's Gemini

How LLMs Work (Simplified)

LLMs predict the next token (word/subword) based on all previous tokens:

The LLM doesn't "understand" in the human sense - it predicts what text should come next based on patterns learned during training.

Token Prediction Visualization

The OpenAI API

We'll use OpenAI's API to access GPT-4o-mini. Here's what an API call looks like:

from openai import OpenAI

client = OpenAI(api_key="sk-...")

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What's idli?"}
    ]
)

print(response.choices[0].message.content)
# "Idli is a traditional South Indian breakfast dish..."

Message Roles

Role	Purpose
`system`	Sets the AI's personality and instructions
`user`	The human's message
`assistant`	The AI's previous responses (for context)

Conversation History

LLMs are stateless - they don't remember previous conversations. To maintain context, you send the entire conversation history with each request:

messages = [
    {"role": "system", "content": "You are an Indian food expert."},
    {"role": "user", "content": "What's idli?"},
    {"role": "assistant", "content": "Idli is a steamed rice cake..."},
    {"role": "user", "content": "How do I make it?"},  # New question
]

Streaming Responses

Instead of waiting for the entire response, we can stream it token by token:

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    stream=True  # Enable streaming
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

This creates the "typing" effect you see in ChatGPT.

Limitations of LLMs

LLMs have important limitations:

Knowledge cutoff - They only know what was in their training data
Hallucinations - They can confidently state incorrect information
No real-time data - They can't access the internet or databases
Context limits - They can only process a limited amount of text

How We Solve These

This is where RAG (Retrieval-Augmented Generation) comes in. Instead of relying on the LLM's training data, we:

Store our curated food database
Retrieve relevant dishes based on the user's query
Include those dishes in the prompt
Let the LLM generate a response using our data

This gives us:

Accuracy - Recommendations from our verified database
Control - We decide what information the LLM can access
Freshness - We can update our database anytime

Next, let's understand RAG in detail.

What is an LLM?​

How LLMs Work (Simplified)​

Token Prediction Visualization​

The OpenAI API​

Message Roles​

Conversation History​

Streaming Responses​

Limitations of LLMs​

How We Solve These​