What is Tokenization?
Tokenization is the process of splitting text into smaller units called tokens - such as words, subwords, or characters - that serve as the basic inputs for natural language processing models.
Tokenization Explained
Tokenization is the first step in almost every NLP pipeline. Before a language model can process text, the text needs to be converted into a format the model can work with. Tokenization splits raw text into discrete units called tokens, and then those tokens are mapped to numerical IDs that the model actually processes.
What constitutes a token depends on the tokenization scheme. Word tokenization splits text at spaces and punctuation: 'Hello, world!' becomes ['Hello', ',', 'world', '!']. Character tokenization treats each character as a token. Subword tokenization - used by most modern models including GPT and BERT - finds a middle ground, breaking uncommon words into meaningful pieces while keeping common words whole. The word 'tokenization' might become ['token', 'ization'].
Subword tokenization is particularly important for handling rare words, technical jargon, and misspellings. Instead of treating an unknown word as a single unknown token, the model can build up its meaning from familiar subword pieces. This dramatically improves a model's ability to handle text 'in the wild' with all its real-world variation.
The concept of a token is also important for understanding the cost and capacity of language models. Most API pricing for large language models charges per token, and models have a maximum context window measured in tokens. A rough rule of thumb for English text is that one token is approximately 0.75 words, though this varies by language and tokenization scheme.
Tokenization choices can have subtle but significant effects on model performance. Languages with complex morphology (like Finnish or Turkish) require different tokenization strategies than English. Code requires tokenization that preserves syntax. Multilingual models need tokenizers that handle dozens of languages fairly, which is a non-trivial engineering challenge in building globally useful language models.
Key Takeaways
Where is Tokenization Used?
Every NLP pipeline and language model, from text preprocessing to API usage of large language models like GPT.
How Copilotly Uses Tokenization
Tokenization explains practical quirks Copilotly engineers around, like why the Translation Copilot's costs differ between languages and why code pastes consume context faster than prose. Each specialist's chunking strategy, from the Contract Reviewer to the Essay Copilot, is tuned to how its typical input tokenizes.
Get Your Answer Now, Free
See tokenization in action with Copilotly's specialized AI copilots.
Frequently Asked Questions
What is the difference between tokenization and word embedding?+
Tokenization is the first step: chopping raw text into discrete units and mapping each to an ID. Embedding is the second: converting each ID into a dense numerical vector that captures meaning. Tokenization decides what the pieces are; embeddings decide what the pieces mean to the model.
What is byte-pair encoding (BPE)?+
BPE builds a vocabulary by starting from individual characters and repeatedly merging the most frequent adjacent pairs until reaching a target size, typically 30,000-200,000 entries. Frequent words end up as single tokens while rare words decompose into familiar subwords, letting a fixed vocabulary cover unlimited text.
Why do some languages cost more tokens than English?+
Tokenizer vocabularies are trained mostly on English-heavy corpora, so English compresses efficiently while languages like Thai, Hindi, or Khmer fragment into many more tokens per sentence, sometimes 3-5x. That inflates both API costs and effective context usage for non-English users.
Can a model's tokenizer be changed after training?+
Not easily. The model's embedding table and everything downstream were learned against one specific vocabulary, so swapping tokenizers invalidates the weights. Extending a vocabulary with new tokens is possible but requires additional training, which is why tokenizer choice is locked in at pretraining time.
Get AI Help Right Where You Browse
Use Copilotly's Get AI-powered professional guidance on any webpage. 131 specialized copilots. copilot directly on any webpage. No tab switching.
