Find most common multi words in an input file in Python

Question

Say I have a text file, I can find the most frequent words easily using Counter. However, I would also like to find multi words like "tax year, fly fishing, u.s. capitol, etc.". Words that occur together the most.

import re
from collections import Counter

with open('full.txt') as f:
    passage = f.read()

words = re.findall(r'\w+', passage)

cap_words = [word for word in words]

word_counts = Counter(cap_words)

for k, v in word_counts.most_common():
    print(k, v)

I have this currently, however, this only find one word. How do I find multiple words?

charles · Accepted Answer

What you're looking for is a way to count bigrams (strings containing two words).

The nltk library is great for doing lots of language related tasks, and you can use Counter from collections for all your counting-related activities!

import nltk
from nltk import bigrams
from collections import Counter

tokens = nltk.word_tokenize(passage)
print(Counter(bigrams(tokens))

Find most common multi words in an input file in Python

Answers (2)

Related Questions