Reputation: 2434
I have a list of lists where each internal list is a sentence that is tokenized into words:
sentences = [['farmer', 'plants', 'grain'],
['fisher', 'catches', tuna'],
['police', 'officer', 'fights', 'crime']]
Currently I am attempting to compute the nGrams like so:
numSentences = len(sentences)
nGrams = []
for i in range(0, numSentences):
nGrams.append(list(ngrams(sentences, 2)))
This results in finding bigrams of the whole list rather than individual words for each internal list (and it repeats for the number of sentences which is somewhat predictable):
[[(['farmer', 'plants', 'grain'], ['fisher', 'catches', tuna']),
(['fisher', 'catches', tuna'], ['police', 'officer', 'fights', 'crime'])],
[(['farmer', 'plants', 'grain'], ['fisher', 'catches', tuna']),
(['fisher', 'catches', tuna'], ['police', 'officer', 'fights', 'crime'])],
[(['farmer', 'plants', 'grain'], ['fisher', 'catches', tuna']),
(['fisher', 'catches', tuna'], ['police', 'officer', 'fights', 'crime'])]]
How do I compute the nGrams of each sentence (by word)? In other words, how to I ensure the nGrams don't span multiple list items? Here is my desired output:
farmer plants
plants grain
fisher catches
catches tuna
police officer
officer fights
fights crime
Upvotes: 0
Views: 3009
Reputation: 5389
You can also consider using scikit-learn's CountVectorizer
as an alternative.
from sklearn.feature_extraction.text import CountVectorizer
sents = list(map(lambda x: ' '.join(x), sentences)) # input is a list of sentences so I map join first
count_vect = CountVectorizer(ngram_range=(2,2)) # bigram
count_vect.fit(sents)
count_vect.vocabulary_
This will give you:
{'catches tuna': 0,
'farmer plants': 1,
'fights crime': 2,
'fisher catches': 3,
'officer fights': 4,
'plants grain': 5,
'police officer': 6}
Upvotes: 1
Reputation: 50190
Take the ngrams of each sentence, and sum up the results together. You probably want to count them, not keep them in a huge collection. Starting with sentences
as a list of lists of words:
counts = collections.Counter() # or nltk.FreqDist()
for sent in sentences:
counts.update(nltk.ngrams(sent, 2))
Or if you prefer a single string rather than a tuple your key:
for sent in sentences:
count.update(" ".join(n) for n in nltk.ngrams(sent, 2))
That's really all there is to it. Then you can see the most common ones, etc.
print(counts.most_common(10))
PS. If you really wanted to pile up the bigrams, you'd do it like this. (Your code forms "bigrams" of sentences not words, because you neglected to write sentences[i]
.) But skip this step and just count them directly.
all_ngrams = []
for sent in sentences:
all_ngrams.extend(nltk.ngrams(sent, 2))
Upvotes: 1
Reputation: 122052
Use list comprehension and chain
to flatten the list :
>>> from itertools import chain
>>> from collections import Counter
>>> from nltk import ngrams
>>> x = [['farmer', 'plants', 'grain'], ['fisher', 'catches', 'tuna'], ['police', 'officer', 'fights', 'crime']]
>>> Counter(chain(*[ngrams(sent,2) for sent in x]))
Counter({('plants', 'grain'): 1, ('police', 'officer'): 1, ('farmer', 'plants'): 1, ('officer', 'fights'): 1, ('fisher', 'catches'): 1, ('fights', 'crime'): 1, ('catches', 'tuna'): 1})
>>> c = Counter(chain(*[ngrams(sent,2) for sent in x]))
Get the keys of the Counter dictionary:
>>> c.keys()
[('plants', 'grain'), ('police', 'officer'), ('farmer', 'plants'), ('officer', 'fights'), ('fisher', 'catches'), ('fights', 'crime'), ('catches', 'tuna')]
>>> [' '.join(b) for b in c.keys()]
['plants grain', 'police officer', 'farmer plants', 'officer fights', 'fisher catches', 'fights crime', 'catches tuna']
Upvotes: 0