Reputation: 71
I would like to analyze a text by counting bigrams. Unfortunately my text has plenty of repeated words (like: hello hello) that I don't want to be counted as bigrams.
My code is the following:
b = nltk.collocations.BigramCollocationFinder.from_words('this this is is a a test test'.split())
b.ngram_fd.keys()
that returns:
>> dict_keys([('this', 'this'), ('this', 'is'), ('is', 'is'), ('is', 'a'), ('a', 'a'), ('a', 'test'), ('test', 'test')])
but I would like the output to be:
>> [('a', 'test'), ('is', 'a'), ('this', 'is')]
Do you have any suggestion, also using a different library? Thank you in advance for any help. Francesca
Upvotes: 1
Views: 666
Reputation: 763
Try:
result_cleared = [x for x in b.ngram_fd.keys() if x[0] != x[1]]
Edit: If your texts are stored in a DataFrame, you can do the following:
# the dummy data from your comment
df=pd.DataFrame({'Text': ['this is a stupid text with no no no sense','this song says na na na','this is very very very very annoying']})
def create_bigrams(text):
b = nltk.collocations.BigramCollocationFinder.from_words(text.split())
return [x for x in b.ngram_fd.keys() if x[0] != x[1]]
df["bigrams"] = df["Text"].apply(create_bigrams)
df["bigrams"].apply(print)
This first adds a column containing the bigrams to the DataFrame and then prints the column values. If you want only the output without manipulating df
, replace the last two lines with:
df["Text"].apply(create_bigrams).apply(print)
Upvotes: 2
Reputation: 6359
You could remove duplicated words before passing into the function nltk.collocations.BigramCollocationFinder.from_words
words = 'this this is is a a test test'.split()
removed_duplicates = [first for first, second in zip(words, ['']+words) if first != second]
output:
['this', 'is', 'a', 'test']
and then do:
b = nltk.collocations.BigramCollocationFinder.from_words(removed_duplicates)
b.ngram_fd.keys()
Upvotes: 1