Bag of words is one of the simplest text representation methods. It is easier to understand when the word bag is read as multiset, not as an everyday physical bag.
Overview
Bag of words represents a document by counting which words appear and how often they appear.
It usually ignores grammar, sentence structure, and word order. The result is a vector of counts over a vocabulary.
Simple example
Take two short documents. First build a vocabulary from the unique words. Then count how often each vocabulary word appears in each document.
documents = [
"cat sat cat",
"dog sat",
]
vocabulary = ["cat", "dog", "sat"]
vectors = [
[2, 0, 1],
[0, 1, 1],
]
print(vocabulary)
print(vectors)What is kept and what is lost
| Part | Kept or lost | Reason |
|---|---|---|
| Word identity | Kept | The vocabulary stores which words are known |
| Word count | Kept | The vector stores how often each word appears |
| Word order | Mostly lost | The vector does not preserve original positions |
| Grammar | Mostly lost | The representation is count based |
| Document length signal | Partly kept | Longer documents can have larger counts |
Why bag means repeated entries
In this context, bag means that repeated items matter. If cat appears twice, the count for cat is 2.
This is different from a set. A set would only tell us that cat exists in the document, not that it appears twice.
from collections import Counter
document = "cat sat cat"
tokens = document.split()
word_set = set(tokens)
word_bag = Counter(tokens)
print(word_set)
print(word_bag)Common confusion
- Bag of words is not a language model.
- Bag of words does not understand word meaning by itself.
- Bag of words is not the same as embeddings.
- Bag of words can still be useful as a baseline for classification, search, and simple text analysis.
Related terms
- Token
- Vocabulary
- Corpus
- Document
- Term frequency
- Sparse vector
- TF-IDF