Reputation: 181
Say I have a text file, I can find the most frequent words easily using Counter. However, I would also like to find multi words like "tax year, fly fishing, u.s. capitol, etc.". Words that occur together the most.
import re
from collections import Counter
with open('full.txt') as f:
passage = f.read()
words = re.findall(r'\w+', passage)
cap_words = [word for word in words]
word_counts = Counter(cap_words)
for k, v in word_counts.most_common():
print(k, v)
I have this currently, however, this only find one word. How do I find multiple words?
Upvotes: 0
Views: 95
Reputation: 1055
What you're looking for is a way to count bigrams (strings containing two words).
The nltk library is great for doing lots of language related tasks, and you can use Counter from collections for all your counting-related activities!
import nltk
from nltk import bigrams
from collections import Counter
tokens = nltk.word_tokenize(passage)
print(Counter(bigrams(tokens))
Upvotes: 3
Reputation: 57033
What you call mutliwords (there is no such thing) is actually called bigrams. You can get a list of bigrams from a list of words by zipping it with itself with a displacement:
bigrams = [f"{x} {y}" for x,y, in zip(words, words[1:])]
P.S. NLTK would be indeed a better tool to get bigrams.
Upvotes: 0