Embeddings

Embeddings are the bridge between human language and vector databases. They convert text into numbers that capture meaning.

What is an Embedding?

An embedding is a numerical representation of text that captures its semantic meaning.

Similar texts produce similar vectors:

How Embeddings Capture Meaning

Embedding models are trained on billions of text samples. They learn that:

"breakfast" and "morning meal" often appear in similar contexts
"spicy" and "hot" have overlapping meanings
"dosa" and "crepe" share characteristics

This training produces vectors where semantic similarity = vector similarity.

Using OpenAI Embeddings

We use OpenAI's text-embedding-3-small model:

from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Idli is a soft, fluffy steamed rice cake from South India"
)

embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}")  # 1536
print(f"First 5 values: {embedding[:5]}")  # [0.023, -0.156, ...]

Embedding Our Food Data

When we ingest our food database:

# Each dish gets embedded
for dish in dishes:
    # Create a rich text description
    text = f"""
    {dish['name']}: {dish['description']}
    Cuisine: {dish['cuisine']}
    Ingredients: {', '.join(dish['ingredients'])}
    Good for: {dish['meal_type']}
    """
    
    # Generate embedding
    embedding = get_embedding(text)
    
    # Store in ChromaDB
    collection.add(
        ids=[dish['id']],
        embeddings=[embedding],
        documents=[text],
        metadatas=[dish]
    )

Similarity Measurement

We measure similarity using cosine similarity:

              A · B
cos(θ) = ─────────────
          |A| × |B|

Where:
- A · B = dot product of vectors
- |A|, |B| = magnitudes of vectors
- Result: -1 (opposite) to 1 (identical)

You don't need to calculate this yourself - ChromaDB does it automatically.

# ChromaDB handles similarity
results = collection.query(
    query_texts=["light breakfast"],  # Gets embedded automatically
    n_results=5
)
# Returns dishes ranked by similarity

Embedding Strategies

What to Embed

For each dish, we create a rich text representation:

def create_dish_embedding_text(dish):
    return f"""
    {dish['name']} - {dish['description']}
    
    Cuisine: {dish['cuisine']}
    Region: {dish['region']}
    Type: {dish['meal_type']}
    
    Ingredients: {', '.join(dish['ingredients'])}
    
    Characteristics: {dish['spice_level']} spice, 
    {dish['prep_time']} minutes to prepare,
    {'high protein' if dish['is_high_protein'] else ''},
    {'low carb' if dish['is_low_carb'] else ''}
    """

The more context we include, the better the semantic matching.

When to Re-embed

You need to re-embed when:

You update dish descriptions
You add new dishes
You change the embedding model

You don't need to re-embed when:

You change metadata (allergens, spice level)
Users update preferences
You modify the LLM prompts

Cost Considerations

OpenAI embedding costs (as of 2024):

Model	Cost per 1M tokens
text-embedding-3-small	$0.02
text-embedding-3-large	$0.13

For our 57-dish database:

Average 100 tokens per dish description
Total: ~5,700 tokens to embed all dishes
Cost: $0.0001 (basically free)

Each user query also needs embedding (~20 tokens), but even with thousands of queries, embedding costs are negligible.

Key Takeaways

Embeddings convert text to vectors that capture meaning
Similar meanings = similar vectors - enables semantic search
OpenAI provides easy-to-use embedding APIs
Embed rich descriptions for better matching

Next, let's understand MCP - the protocol for giving AI access to tools.

What is an Embedding?​

How Embeddings Capture Meaning​

Using OpenAI Embeddings​

Embedding Our Food Data​

Similarity Measurement​

Embedding Strategies​

What to Embed​

When to Re-embed​

Cost Considerations​

Key Takeaways​