Talking Tokens

Ever wondered how AI, from creating images to writing text, actually “thinks”? It’s not a monolithic brain, but rather an intricate network built from countless tiny components. To truly grasp AI’s capabilities, we need to understand its fundamental building blocks: tokens.

Think of the human brain. Its astounding ability to process information and generate thoughts comes from billions of individual brain cells, or neurons, each connecting and firing signals in complex patterns.

In much the same way, if an entire AI model is like the vast human brain, then AI tokens are its individual brain cells.

WHY

The answer lies in the fundamental nature of how computers process information, it only understands numbers.

HOW

AI doesn’t just “read” text; it breaks it down into tokens using a tokenizer. Think of a tokenizer as a linguistic surgeon, dissecting raw text into meaningful units.

The exact method varies, but the general process is:

Break Down Text: The tokenizer first splits continuous text into smaller units. This isn’t just splitting by spaces. There are different types of splitting stratgegies used, below table summarizes the pros and cons of each, and looking at it you can guess why the most popular one is Subword tokenization

Subword tokenization (like BPE or WordPiece) is common for large language models. It smartly balances word and character-level approaches. It identifies common characters and sequences, then merges frequent pairs into new, longer subwords. This allows the AI to understand both common words and rare ones by breaking them into familiar subword units (e.g., “unbelievable” becomes [“un”, “believe”, “able”]).

Assign Numerical IDs: Each unique token is then given a unique numerical ID. This ID is the AI’s actual language (e.g., “the” becomes 123).

Great way to visualize the token generation is to use tiktok, just kidding, its to use tiktokenizer

If you simply type the word “unbelievable” you can see how one word generates 3 or more tokens.

Tokenization is a crucial step, but it’s part of a much longer journey. Significant processes, starting with data gathering and continuing well beyond tokenization, are necessary before we arrive at a sophisticated LLM like ChatGPT. This intricate pipeline owes much to the insightful, freely available content from Andrej Karpathy on YouTube. More to come.

Now, go fix some bugs!

Splitting Up Without Breaking Up: Partitioning Your Database with Style

WHY

In the high-stakes world of database management, sometimes the healthiest relationship is one with boundaries. If your database is starting to feel overwhelmed, sluggish, or just plain unmanageable, it might be time for the “we need to talk” conversation. But don’t worry—this isn’t a breakup; it’s a strategic restructuring that will make your relationship with your data stronger.

WHAT

Think of partitioning as sharding’s more localized cousin — a way to break up data within a single database or server to improve performance, maintainability, and query efficiency. While sharding is about splitting data across multiple nodes, partitioning is about organizing data smarter within the same node. Database partitioning is the practice of dividing a database table into smaller, more manageable segments based on defined rules while maintaining the logical appearance of a single table to applications interacting with it.

ANALOGY

Database partitioning is like organizing a clothing store where instead of piling all merchandise (data) into one massive, chaotic section, you thoughtfully arrange men’s, women’s, and children’s clothes into separate departments (partitions). Shoppers can easily find what they need, store employees can efficiently manage inventory and restocking for their specific section, and the store can expand by adding specialized sections without disrupting the existing layout – all while maintaining a seamless shopping experience.

HOW

Just as there are multiple ways to organize a closet, databases offer several partitioning strategies, each with its own strengths:

  • Horizontal Partitioning (Row-Based):Splits table rows across partitions based on ranges of a column value, like dividing customer records by date ranges or ID ranges.
  • Vertical Partitioning (Column-Based): Separates columns of a table into different partitions, typically grouping frequently accessed columns together and rarely used columns in separate partitions.
  • Functional Partitioning: Organizes data based on how it’s used in your application, grouping related tables or functionality together regardless of structural similarities.
  • List Partitioning: Divides data based on specific, discrete values in a column, such as storing customer data in different partitions based on country or region.
  • Hash Partitioning: Distributes rows evenly across partitions using a hash function on the partition key, ideal when natural groupings don’t exist or balanced distribution is critical.
  • Composite Partitioning: Combines multiple partitioning strategies, such as first partitioning by date range, then sub-partitioning each range by region or customer type.

Let’s look at an example with range partitioning the most common one

Let’s say you have a table called orders, and you want to partition it by order_date, one partition per year.

1. Create the Partitioned Table

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    customer_id INT NOT NULL,
    order_date DATE NOT NULL,
    amount NUMERIC
) PARTITION BY RANGE (order_date);

2. Create Yearly Partitions

CREATE TABLE orders_2023 PARTITION OF orders
    FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');

CREATE TABLE orders_2024 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');

CREATE TABLE orders_2025 PARTITION OF orders
    FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');

3. Insert Data (Postgres Routes Automatically)

INSERT INTO orders (customer_id, order_date, amount)
VALUES 
    (101, '2023-03-15', 200.00),
    (102, '2024-07-01', 350.00),
    (103, '2025-01-20', 500.00);

4. Query Normally

SELECT * FROM orders WHERE order_date >= '2024-01-01';

PostgreSQL will automatically prune irrelevant partitions during the query for performance gains.

Now, go fix some bugs!