Reputation: 11
I have written a program to find the frequency of words in Python. I am stuck at a place where I need to find the frequency of bigrams without considering the word order. That means " in the" should be counted same as "the in". Code to find bigram frequency:
txt = open('txt file', 'r')
finder1 = BigramCollocationFinder.from_words(txt.read().split(),window_size = 3)
finder1.apply_freq_filter(3)
bigram_measures = nltk.collocations.BigramAssocMeasures()
for k,v in sorted(list(combinations((set(finder1.ngram_fd.items())),2)),key=lambda t:t[-1], reverse=True)[:10]:
print(k,v)
Upvotes: 1
Views: 340
Reputation: 1922
This seems like somewhere you could use sets for the keys in a Counter. You can see from the linked docs that sets are unordered containers and Counters are dictionaries that are specialized for counting occurrences of objects in an iterable. Could look something like this:
from string import punctuation as punct
with open('txt file.txt') as txt:
doc = txt.read().translate({c: '' for c in punct}).split()
c = Counter()
c.update(fronzenset((doc[i], doc[i+1])) for i in range(len(doc) - 1))
The with
statement handles the file, then automatically closes the connection. From there it reads it into list of words separated by whitespace characters (spaces, newlines, etc...). Then it initializes the Counter and counts unordered pairs of words in the string.
Upvotes: 1