botti23
botti23

Reputation: 71

Bigram without repeated words

I would like to analyze a text by counting bigrams. Unfortunately my text has plenty of repeated words (like: hello hello) that I don't want to be counted as bigrams.

My code is the following:

b = nltk.collocations.BigramCollocationFinder.from_words('this this is is a a test test'.split())
b.ngram_fd.keys()

that returns:

>> dict_keys([('this', 'this'), ('this', 'is'), ('is', 'is'), ('is', 'a'), ('a', 'a'), ('a', 'test'), ('test', 'test')])

but I would like the output to be:

>> [('a', 'test'), ('is', 'a'), ('this', 'is')]

Do you have any suggestion, also using a different library? Thank you in advance for any help. Francesca

Upvotes: 1

Views: 666

Answers (2)

Georgy Kopshteyn
Georgy Kopshteyn

Reputation: 763

Try:

result_cleared = [x for x in b.ngram_fd.keys() if x[0] != x[1]]

Edit: If your texts are stored in a DataFrame, you can do the following:

# the dummy data from your comment
df=pd.DataFrame({'Text': ['this is a stupid text with no no no sense','this song says na na na','this is very very very very annoying']})

def create_bigrams(text):
    b = nltk.collocations.BigramCollocationFinder.from_words(text.split())
    return [x for x in b.ngram_fd.keys() if x[0] != x[1]]

df["bigrams"] = df["Text"].apply(create_bigrams)
df["bigrams"].apply(print)

This first adds a column containing the bigrams to the DataFrame and then prints the column values. If you want only the output without manipulating df, replace the last two lines with:

df["Text"].apply(create_bigrams).apply(print)

Upvotes: 2

Tom McLean
Tom McLean

Reputation: 6359

You could remove duplicated words before passing into the function nltk.collocations.BigramCollocationFinder.from_words

words = 'this this is is a a test test'.split()
removed_duplicates = [first for first, second in zip(words, ['']+words) if first != second]

output:

['this', 'is', 'a', 'test']

and then do:

b = nltk.collocations.BigramCollocationFinder.from_words(removed_duplicates)
b.ngram_fd.keys()

Upvotes: 1

Related Questions