DSMK Swab
DSMK Swab

Reputation: 181

Find most common multi words in an input file in Python

Say I have a text file, I can find the most frequent words easily using Counter. However, I would also like to find multi words like "tax year, fly fishing, u.s. capitol, etc.". Words that occur together the most.

import re
from collections import Counter

with open('full.txt') as f:
    passage = f.read()

words = re.findall(r'\w+', passage)

cap_words = [word for word in words]

word_counts = Counter(cap_words)

for k, v in word_counts.most_common():
    print(k, v)

I have this currently, however, this only find one word. How do I find multiple words?

Upvotes: 0

Views: 95

Answers (2)

charles
charles

Reputation: 1055

What you're looking for is a way to count bigrams (strings containing two words).

The nltk library is great for doing lots of language related tasks, and you can use Counter from collections for all your counting-related activities!

import nltk
from nltk import bigrams
from collections import Counter

tokens = nltk.word_tokenize(passage)
print(Counter(bigrams(tokens))

Upvotes: 3

DYZ
DYZ

Reputation: 57033

What you call mutliwords (there is no such thing) is actually called bigrams. You can get a list of bigrams from a list of words by zipping it with itself with a displacement:

bigrams = [f"{x} {y}" for x,y, in zip(words, words[1:])]

P.S. NLTK would be indeed a better tool to get bigrams.

Upvotes: 0

Related Questions