Reputation: 417
I'm trying to create bigrams using nltk which don't cross sentence boundaries. I tried using from_documents, however, it isn't working as I had hoped.
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_documents([['This', 'is', 'sentence', 'one'], ['A', 'second', 'sentence']])
print finder.nbest(bigram_measures.pmi, 10)
>> [(u'A', u'second'), (u'This', u'is'), (u'one', u'A'), (u'is', u'sentence'), (u'second', u'sentence'), (u'sentence', u'one')]
This includes (u'one', u'A'), which I'm trying to avoid.
Upvotes: 3
Views: 486
Reputation: 417
I ended up ditching nltk and doing the processing by hand:
To create ngrams, I found this handy function on
def find_ngrams(input_list, n):
return zip(*[input_list[i:] for i in range(n)])
From there, I computed bigram probabilities doing the following:
First I created the bigrams
all_bigrams = [find_ngrams(sentence, 2) for sentence in text]
Then I grouped them by first word
first_words = {}
for bigram in all_bigrams:
if bigram[0] in first_words.keys():
first_words[bigram[0]] = [bigram]
I then computed the probabilities for each bigram
bi_probabilites = {}
for bigram in (set(all_bigrams)):
bigram_count = 0
first_word_list = first_words[bigram[0]]
for item in first_word_list:
if item == bigram:
bigram_count += 1
bi_probabilites[bigram] = {
'count': bigram_count,
'length': len(first_word_list),
'prob': float(bigram_count)/len(first_word_list)
Not the most elegant, but it gets the job done.
Upvotes: 2