Bag of Words | MLNotebooks Glossary

Bag of words is one of the simplest text representation methods. It is easier to understand when the word bag is read as multiset, not as an everyday physical bag.

Overview

Bag of words represents a document by counting which words appear and how often they appear.

It usually ignores grammar, sentence structure, and word order. The result is a vector of counts over a vocabulary.

Simple example

Take two short documents. First build a vocabulary from the unique words. Then count how often each vocabulary word appears in each document.

documents = [
    "cat sat cat",
    "dog sat",
]

vocabulary = ["cat", "dog", "sat"]

vectors = [
    [2, 0, 1],
    [0, 1, 1],
]

print(vocabulary)
print(vectors)

What is kept and what is lost

Part	Kept or lost	Reason
Word identity	Kept	The vocabulary stores which words are known
Word count	Kept	The vector stores how often each word appears
Word order	Mostly lost	The vector does not preserve original positions
Grammar	Mostly lost	The representation is count based
Document length signal	Partly kept	Longer documents can have larger counts

Why bag means repeated entries

In this context, bag means that repeated items matter. If cat appears twice, the count for cat is 2.

This is different from a set. A set would only tell us that cat exists in the document, not that it appears twice.

from collections import Counter

document = "cat sat cat"
tokens = document.split()

word_set = set(tokens)
word_bag = Counter(tokens)

print(word_set)
print(word_bag)

Common confusion

Bag of words is not a language model.
Bag of words does not understand word meaning by itself.
Bag of words is not the same as embeddings.
Bag of words can still be useful as a baseline for classification, search, and simple text analysis.

Related terms

Token
Vocabulary
Corpus
Document
Term frequency
Sparse vector
TF-IDF