Ctrl K

Bag of Words

A text representation that keeps word counts while ignoring most word order.

Bag of words is one of the simplest text representation methods. It is easier to understand when the word bag is read as multiset, not as an everyday physical bag.

Overview

Bag of words represents a document by counting which words appear and how often they appear.

It usually ignores grammar, sentence structure, and word order. The result is a vector of counts over a vocabulary.

Simple example

Take two short documents. First build a vocabulary from the unique words. Then count how often each vocabulary word appears in each document.

documents = [
    "cat sat cat",
    "dog sat",
]

vocabulary = ["cat", "dog", "sat"]

vectors = [
    [2, 0, 1],
    [0, 1, 1],
]

print(vocabulary)
print(vectors)

What is kept and what is lost

PartKept or lostReason
Word identityKeptThe vocabulary stores which words are known
Word countKeptThe vector stores how often each word appears
Word orderMostly lostThe vector does not preserve original positions
GrammarMostly lostThe representation is count based
Document length signalPartly keptLonger documents can have larger counts

Why bag means repeated entries

In this context, bag means that repeated items matter. If cat appears twice, the count for cat is 2.

This is different from a set. A set would only tell us that cat exists in the document, not that it appears twice.

from collections import Counter

document = "cat sat cat"
tokens = document.split()

word_set = set(tokens)
word_bag = Counter(tokens)

print(word_set)
print(word_bag)

Common confusion

  • Bag of words is not a language model.
  • Bag of words does not understand word meaning by itself.
  • Bag of words is not the same as embeddings.
  • Bag of words can still be useful as a baseline for classification, search, and simple text analysis.

Related terms

  • Token
  • Vocabulary
  • Corpus
  • Document
  • Term frequency
  • Sparse vector
  • TF-IDF