From Words to Lego Bricks
At its heart, tokenization is the process of breaking down a stream of text into smaller, manageable pieces called tokens. Think of it like deconstructing a complex Lego model into its individual bricks. The AI can't process a whole sentence or paragraph
at once; it needs to see the language in these discrete units. These tokens can be words, parts of words, or even individual characters and punctuation. This step is the essential bridge between human language and the numbers that a neural network actually operates on.
The Obvious Method (And Why It Fails)
The most intuitive way to tokenize is to simply split text by spaces and punctuation, making each word a token. The sentence "The cat sat" would become three tokens: "The", "cat", and "sat". This is simple, but it creates massive problems for a powerful AI. For one, the vocabulary would be enormous, as every single word form (“run,” “runs,” “running”) would need its own unique token. This makes the model slow and memory-intensive. More importantly, this approach has no way to handle words it has never seen before, like new slang, technical jargon, or even simple typos. It would just see an unknown word, which is a major blind spot.
A Smarter Approach: Subword Tokenization
To solve this, modern AI models like those from OpenAI and Google use a clever technique called subword tokenization. The core idea is that frequent words should remain whole, while rare words should be broken down into smaller, meaningful parts. For example, a common word like “and” might be a single token. But a less common, more complex word like “tokenization” could be split into familiar subwords, such as “token” and “ization”. This is incredibly powerful because even if the model has never seen the word “tokenization”, it has likely seen “token” and “ization” thousands of times in other contexts. By combining these known pieces, it can make a very good guess at the meaning of the new word.
How AI Learns the Subwords
These subword vocabularies are created using algorithms like Byte-Pair Encoding (BPE) or WordPiece. The process starts by looking at a massive dataset of text and initializing the vocabulary with every individual character. Then, the algorithm repeatedly finds the most frequently occurring adjacent pair of characters and merges them into a single new token. For example, if 'e' and 's' appear together often, they are merged into a new subword token, 'es'. This process continues, merging the most common pairs of existing tokens until the vocabulary reaches a predetermined size, often between 30,000 and 100,000 tokens. This creates a highly efficient dictionary of the most common letters and word fragments.
The Final Step: Turning Tokens into Numbers
Once the tokenizer has its vocabulary, the final step is simple: every unique token is assigned a permanent ID number. For instance, “the” might become token 792, the subword “ing” might be 389, and a question mark might be 30. When you type a prompt, the text is first broken down into this sequence of tokens. Then, that sequence of tokens is converted into a sequence of numbers. This list of numbers is the only thing the AI model ever actually sees. When the model generates a response, it works in reverse: it produces a sequence of token IDs, which are then decoded back into subwords and stitched together to form the human-readable text you see on your screen.
















