OverflowingTheGlass
OverflowingTheGlass

Reputation: 2434

Compute nGrams across a list of lists of sentences using nltk

I have a list of lists where each internal list is a sentence that is tokenized into words:

sentences = [['farmer', 'plants', 'grain'], 
             ['fisher', 'catches', tuna'], 
             ['police', 'officer', 'fights', 'crime']]

Currently I am attempting to compute the nGrams like so:

numSentences = len(sentences)
nGrams = []
for i in range(0, numSentences):
       nGrams.append(list(ngrams(sentences, 2)))

This results in finding bigrams of the whole list rather than individual words for each internal list (and it repeats for the number of sentences which is somewhat predictable):

[[(['farmer', 'plants', 'grain'], ['fisher', 'catches', tuna']),
  (['fisher', 'catches', tuna'], ['police', 'officer', 'fights', 'crime'])], 
[(['farmer', 'plants', 'grain'], ['fisher', 'catches', tuna']), 
 (['fisher', 'catches', tuna'], ['police', 'officer', 'fights', 'crime'])], 
[(['farmer', 'plants', 'grain'], ['fisher', 'catches', tuna']), 
 (['fisher', 'catches', tuna'], ['police', 'officer', 'fights', 'crime'])]]

How do I compute the nGrams of each sentence (by word)? In other words, how to I ensure the nGrams don't span multiple list items? Here is my desired output:

farmer plants
plants grain
fisher catches
catches tuna
police officer
officer fights
fights crime

Upvotes: 0

Views: 3009

Answers (3)

titipata
titipata

Reputation: 5389

You can also consider using scikit-learn's CountVectorizer as an alternative.

from sklearn.feature_extraction.text import CountVectorizer

sents = list(map(lambda x: ' '.join(x), sentences)) # input is a list of sentences so I map join first
count_vect = CountVectorizer(ngram_range=(2,2)) # bigram
count_vect.fit(sents)
count_vect.vocabulary_

This will give you:

{'catches tuna': 0,
 'farmer plants': 1,
 'fights crime': 2,
 'fisher catches': 3,
 'officer fights': 4,
 'plants grain': 5,
 'police officer': 6}

Upvotes: 1

alexis
alexis

Reputation: 50190

Take the ngrams of each sentence, and sum up the results together. You probably want to count them, not keep them in a huge collection. Starting with sentences as a list of lists of words:

counts = collections.Counter()   # or nltk.FreqDist()
for sent in sentences:
    counts.update(nltk.ngrams(sent, 2))

Or if you prefer a single string rather than a tuple your key:

for sent in sentences:
    count.update(" ".join(n) for n in nltk.ngrams(sent, 2))

That's really all there is to it. Then you can see the most common ones, etc.

print(counts.most_common(10))

PS. If you really wanted to pile up the bigrams, you'd do it like this. (Your code forms "bigrams" of sentences not words, because you neglected to write sentences[i].) But skip this step and just count them directly.

all_ngrams = []
for sent in sentences:
    all_ngrams.extend(nltk.ngrams(sent, 2))

Upvotes: 1

alvas
alvas

Reputation: 122052

Use list comprehension and chain to flatten the list :

>>> from itertools import chain
>>> from collections import Counter
>>> from nltk import ngrams

>>> x = [['farmer', 'plants', 'grain'], ['fisher', 'catches', 'tuna'], ['police', 'officer', 'fights', 'crime']]

>>> Counter(chain(*[ngrams(sent,2) for sent in x]))
Counter({('plants', 'grain'): 1, ('police', 'officer'): 1, ('farmer', 'plants'): 1, ('officer', 'fights'): 1, ('fisher', 'catches'): 1, ('fights', 'crime'): 1, ('catches', 'tuna'): 1})

>>> c = Counter(chain(*[ngrams(sent,2) for sent in x]))

Get the keys of the Counter dictionary:

>>> c.keys()
[('plants', 'grain'), ('police', 'officer'), ('farmer', 'plants'), ('officer', 'fights'), ('fisher', 'catches'), ('fights', 'crime'), ('catches', 'tuna')]

Join strings with spaces

>>> [' '.join(b) for b in c.keys()]
['plants grain', 'police officer', 'farmer plants', 'officer fights', 'fisher catches', 'fights crime', 'catches tuna']

Upvotes: 0

Related Questions