concept

Subword Tokenization

Subword tokenization is a natural language processing technique that breaks text into smaller units called subwords, such as prefixes, suffixes, or character n-grams, rather than whole words or individual characters. It helps handle out-of-vocabulary words by splitting them into known subword pieces, improving model generalization across languages and domains. This approach is widely used in modern NLP models like BERT and GPT to efficiently process diverse vocabularies.

Also known as: Byte Pair Encoding, WordPiece, SentencePiece, BPE, Subword Segmentation

🧊Why learn Subword Tokenization?

Developers should learn subword tokenization when building NLP applications that need to handle rare words, multiple languages, or domain-specific terminology, as it reduces vocabulary size and improves model performance on unseen text. It is essential for tasks like machine translation, text classification, and named entity recognition where word-level tokenization fails with new or complex words. Use cases include training transformer models, processing social media text with slang, or working with morphologically rich languages like Finnish or Turkish.