Talking Tokens

Ever wondered how AI, from creating images to writing text, actually “thinks”? It’s not a monolithic brain, but rather an intricate network built from countless tiny components. To truly grasp AI’s capabilities, we need to understand its fundamental building blocks: tokens.

Think of the human brain. Its astounding ability to process information and generate thoughts comes from billions of individual brain cells, or neurons, each connecting and firing signals in complex patterns.

In much the same way, if an entire AI model is like the vast human brain, then AI tokens are its individual brain cells.

WHY

The answer lies in the fundamental nature of how computers process information, it only understands numbers.

HOW

AI doesn’t just “read” text; it breaks it down into tokens using a tokenizer. Think of a tokenizer as a linguistic surgeon, dissecting raw text into meaningful units.

The exact method varies, but the general process is:

Break Down Text: The tokenizer first splits continuous text into smaller units. This isn’t just splitting by spaces. There are different types of splitting stratgegies used, below table summarizes the pros and cons of each, and looking at it you can guess why the most popular one is Subword tokenization

Subword tokenization (like BPE or WordPiece) is common for large language models. It smartly balances word and character-level approaches. It identifies common characters and sequences, then merges frequent pairs into new, longer subwords. This allows the AI to understand both common words and rare ones by breaking them into familiar subword units (e.g., “unbelievable” becomes [“un”, “believe”, “able”]).

Assign Numerical IDs: Each unique token is then given a unique numerical ID. This ID is the AI’s actual language (e.g., “the” becomes 123).

Great way to visualize the token generation is to use tiktok, just kidding, its to use tiktokenizer

If you simply type the word “unbelievable” you can see how one word generates 3 or more tokens.

Tokenization is a crucial step, but it’s part of a much longer journey. Significant processes, starting with data gathering and continuing well beyond tokenization, are necessary before we arrive at a sophisticated LLM like ChatGPT. This intricate pipeline owes much to the insightful, freely available content from Andrej Karpathy on YouTube. More to come.

Now, go fix some bugs!

Re-ranking in Vector search

Analogy to define the problem

Imagine a child has a big box of Lego bricks of various colors. They’re asked to build a tower, but with a special rule: the bricks that are most similar to the color red should be placed at the top, and the bricks that are least similar to red should be at the bottom. Now, let’s say you introduce a condition to complete it in the shortest possible time.

In the shortest possible time, there is a chance they may make mistakes and overlook a few things due to the emphasis on speed.

Analogies are fun, but don’t try to make them too perfect – they all break down eventually!

How does this problem apply to vector search

Approximation: Many use approximate nearest neighbor (ANN) techniques for speed, which means they don’t always find the absolute closest vectors, leading to slightly inaccurate rankings.

Vector Representation: The quality of the vector embeddings matters. If the vectors don’t accurately capture the semantic meaning of the data, similarity scores will be misleading.

Distance Metric: The chosen distance metric (e.g., cosine similarity, Euclidean distance) may not perfectly align with the user’s perception of relevance.

Data Distribution: Vector spaces can be high dimensional and sparse, causing the “curse of dimensionality” where distance becomes less meaningful.

What exactly happens in re-ranking

Re-ranking Model

The initial search results, along with the query vector, are processed by an advanced model to refine relevance scoring.
Typically, this model is transformer-based (e.g., BERT) and trained to capture semantic similarity.
It evaluates each document in relation to the query and may also consider contextual relationships among retrieved documents.
The model assigns a relevance score to each document, reflecting its alignment with the query.

Re-ordering

The retrieved results are re-ranked based on the relevance scores from the re-ranking model.
Documents with the highest scores are prioritized at the top of the results list for improved accuracy.

Example Code

import numpy as np
from sklearn.neighbors import NearestNeighbors
from sentence_transformers import SentenceTransformer

# 1. Define sample data
documents = [
    "Machine learning is a subfield of artificial intelligence",
    "Neural networks are used in deep learning applications",
    "Natural language processing deals with text data",
    "Computer vision focuses on image recognition",
    "Reinforcement learning uses rewards to train agents"
]

# 2. Create vector embeddings for documents
encoder = SentenceTransformer('all-MiniLM-L6-v2')
document_embeddings = encoder.encode(documents)

# 3. Build ANN index
ann_index = NearestNeighbors(n_neighbors=3, algorithm='ball_tree')
ann_index.fit(document_embeddings)

# 4. Define user query and convert to embedding
query = "How do AI systems understand language?"
query_embedding = encoder.encode([query])[0].reshape(1, -1)

# 5. Initial Retrieval: Get approximate nearest neighbors
distances, indices = ann_index.kneighbors(query_embedding)
initial_results = [(i, documents[i], distances[0][idx]) for idx, i in enumerate(indices[0])]

print("Initial ANN Results:")
for idx, doc, dist in initial_results:
    print(f"Doc {idx}: {doc} (Distance: {dist:.4f})")

# 6. Re-ranking with a more sophisticated model (simulated here)
def rerank_model(query, candidates):
    # In a real system, this would be a transformer model like BERT
    # Here we simulate with a simple relevance calculation
    relevance_scores = []
    for _, doc, _ in candidates:
        # Count relevant keywords as a simple simulation
        keywords = ["language", "text", "natural", "processing"]
        score = sum(keyword in doc.lower() for keyword in keywords)
        # Add a baseline score
        score += 0.5
        relevance_scores.append(score)
    return relevance_scores

# 7. Apply re-ranking
relevance_scores = rerank_model(query, initial_results)

# 8. Re-order results based on new relevance scores
reranked_results = [(initial_results[i][0], initial_results[i][1], score) 
                    for i, score in enumerate(relevance_scores)]
reranked_results.sort(key=lambda x: x[2], reverse=True)

print("\nRe-ranked Results:")
for idx, doc, score in reranked_results:
    print(f"Doc {idx}: {doc} (Relevance Score: {score:.4f})")

When Re-Ranking Offers Limited Value

Poor initial results: If the first search is bad, re-ranking can’t fix it.

Uniform relevance: If everything is equally relevant/irrelevant, re-ranking is pointless.

Computational cost: Re-ranking is slow, impacting real-time use.

Relevance mismatch: The model’s “relevance” differs from the user’s.

Ambiguous queries: Re-ranking struggles with unclear requests.

Data bias: Biased training leads to biased results.

This serves as a reference to clarify my understanding of the problem, explore a potential solution, and outline its limitations.

Now, go fix some bugs! 🚀

Vector Search, R of the RAG

What

A vector is a quantity that has both magnitude (or length) and direction. A vector might represent the features of an image or a piece of text. In this context, the numbers within the vector represent values of those features.

Why

Imagine you’re searching for a car with a red color. In a traditional keyword search, you’d need the car’s description to explicitly mention “red color” to get a match. But car companies rarely keep it that simple — instead, they use creative names like “Cherry Bomb Tintcoat” or “Rosso Corsa”. A keyword search would struggle to connect your “red color” query with these fancy names. However, with vector search, the system understands the underlying meaning and context, so even abstract color names show up — ranked by a relevance score that reflects how closely your query aligns with the results.

How

Vector search, powered by embeddings, allows systems to understand the meaning of words and phrases. Embeddings convert text into numerical vectors that represent their semantic relationships.

Example

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from sentence_transformers import SentenceTransformer

# Initialize a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample car color names (like a car database)
car_colors = [
    "Cherry Bomb Tintcoat",
    "Rosso Corsa",
    "Soul Red Crystal Metallic",
    "Firecracker Red",
    "Carmine Red",
    "Race Red",
    "Ruby Flare Pearl"
]

# Convert car color names to vectors (embeddings)
color_embeddings = model.encode(car_colors)

# User query
user_query = "I want a car with red color"

# Convert user query to vector
query_embedding = model.encode([user_query])

# Calculate cosine similarity between query and car color names
similarities = cosine_similarity(query_embedding, color_embeddings)[0]

# Rank results by similarity score
ranked_results = sorted(zip(car_colors, similarities), key=lambda x: x[1], reverse=True)

# Display top results
print("Top matches for your query:\n")
for car_color, score in ranked_results[:3]:
    print(f"{car_color} (Relevance Score: {score:.2f})")

This approach offers a more effective way to retrieve relevant contextual information or facts based on a user’s query or question. In the context of Retrieval-Augmented Generation (RAG), this information enhances the model’s responses by grounding them in accurate, relevant data. I’ll dive deeper into how this works in my next post.

Now, go fix some bugs! 🚀