Reputation: 8463
I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like:
"Cystic fibrosis affects 30,000 children and young adults in the US alone Inhaling the mists of salt water can reduce the pus and infection that fills the airways of cystic fibrosis sufferers, although side effects include a nasty coughing fit and a harsh taste. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."
I started in Python and used the following code:
#!/usr/bin/env python
# File: n-gram.py
def N_Gram(N,text):
NList = [] # start with an empty list
if N> 1:
space = " " * (N-1) # add N - 1 spaces
text = space + text + space # add both in front and back
# append the slices [i:i+N] to NList
for i in range( len(text) - (N - 1) ):
NList.append(text[i:i+N])
return NList # return the list
# test code
for i in range(5):
print N_Gram(i+1,"text")
# more test code
nList = N_Gram(7,"Here is a lot of text to print")
for ngram in iter(nList):
print '"' + ngram + '"'
http://www.daniweb.com/software-development/python/threads/39109/generating-n-grams-from-a-word
But it works for all the n-grams within a word, when I want it from between words as in CYSTIC and FIBROSIS or CYSTIC FIBROSIS. Can someone help me out as to how I can get this done?
Upvotes: 36
Views: 107188
Reputation: 17779
Assuming input is a string contains space separated words, like x = "a b c d"
you can use the following function (edit: see the last function for a possibly more complete solution):
def ngrams(input, n):
input = input.split(' ')
output = []
for i in range(len(input)-n+1):
output.append(input[i:i+n])
return output
ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]
If you want those joined back into strings, you might call something like:
[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']
Lastly, that doesn't summarize things into totals, so if your input was 'a a a a'
, you need to count them up into a dict:
for g in (' '.join(x) for x in ngrams(input, 2)):
grams.setdefault(g, 0)
grams[g] += 1
Putting that all together into one final function gives:
def ngrams(input, n):
input = input.split(' ')
output = {}
for i in range(len(input)-n+1):
g = ' '.join(input[i:i+n])
output.setdefault(g, 0)
output[g] += 1
return output
ngrams('a a a a', 2) # {'a a': 3}
Upvotes: 41
Reputation: 24788
Using collections.deque
:
from collections import deque
from itertools import islice
def ngrams(message, n=1):
it = iter(message.split())
window = deque(islice(it, n), maxlen=n)
yield tuple(window)
for item in it:
window.append(item)
yield tuple(window)
...or maybe you could do it in one line as a list comprehension:
n = 2
message = "Hello, how are you?".split()
myNgrams = [message[i:i+n] for i in range(len(message) - n + 1)]
Upvotes: 3
Reputation: 1345
If efficiency is an issue and you have to build multiple different n-grams I would consider using the following code (building up on Franck's excellent answer):
from itertools import chain
def n_grams(seq, n=1):
"""Returns an iterator over the n-grams given a list_tokens"""
shift_token = lambda i: (el for j,el in enumerate(seq) if j>=i)
shifted_tokens = (shift_token(i) for i in range(n))
tuple_ngrams = zip(*shifted_tokens)
return tuple_ngrams # if join in generator : (" ".join(i) for i in tuple_ngrams)
def range_ngrams(list_tokens, ngram_range=(1,2)):
"""Returns an itirator over all n-grams for n in range(ngram_range) given a list_tokens."""
return chain(*(n_grams(list_tokens, i) for i in range(*ngram_range)))
Usage :
>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngram_range=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]
~Same speed as NLTK:
import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngram_range=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 2
Reputation: 2618
Though the post is old, I thought to mention my answer here so that most of the ngrams creation logic can be in one post.
There is something by name TextBlob in Python. It creates ngrams very easily similar to NLTK.
Below is the code snippet with its output for easy understanding.
sent = """This is to show the usage of Text Blob in Python"""
blob = TextBlob(sent)
unigrams = blob.ngrams(n=1)
bigrams = blob.ngrams(n=2)
trigrams = blob.ngrams(n=3)
And the output is :
unigrams
[WordList(['This']),
WordList(['is']),
WordList(['to']),
WordList(['show']),
WordList(['the']),
WordList(['usage']),
WordList(['of']),
WordList(['Text']),
WordList(['Blob']),
WordList(['in']),
WordList(['Python'])]
bigrams
[WordList(['This', 'is']),
WordList(['is', 'to']),
WordList(['to', 'show']),
WordList(['show', 'the']),
WordList(['the', 'usage']),
WordList(['usage', 'of']),
WordList(['of', 'Text']),
WordList(['Text', 'Blob']),
WordList(['Blob', 'in']),
WordList(['in', 'Python'])]
trigrams
[WordList(['This', 'is', 'to']),
WordList(['is', 'to', 'show']),
WordList(['to', 'show', 'the']),
WordList(['show', 'the', 'usage']),
WordList(['the', 'usage', 'of']),
WordList(['usage', 'of', 'Text']),
WordList(['of', 'Text', 'Blob']),
WordList(['Text', 'Blob', 'in']),
WordList(['Blob', 'in', 'Python'])]
As simple as that.
There is more to this that are being done by TextBlob. Please have a look at this doc for more details - https://textblob.readthedocs.io/en/dev/
Upvotes: 2
Reputation: 117
nltk has native support for ngrams
'n' is the ngram size ex: n=3 is for a trigram
from nltk import ngrams
def ngramize(texts, n):
output=[]
for text in texts:
output += ngrams(text,n)
return output
Upvotes: 2
Reputation: 83167
A short Pythonesque solution from this blog:
def find_ngrams(input_list, n):
return zip(*[input_list[i:] for i in range(n)])
Usage:
>>> input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
>>> find_ngrams(input_list, 1)
[('all',), ('this',), ('happened',), ('more',), ('or',), ('less',)]
>>> find_ngrams(input_list, 2)
[('all', 'this'), ('this', 'happened'), ('happened', 'more'), ('more', 'or'), ('or', 'less')]
>>> find_ngrams(input_list, 3))
[('all', 'this', 'happened'), ('this', 'happened', 'more'), ('happened', 'more', 'or'), ('more', 'or', 'less')]
Upvotes: 53
Reputation: 2805
There is one more interesting module into python called Scikit. Here is the code. This will help u to get all the grams given in a particular range. Here is the code
from sklearn.feature_extraction.text import CountVectorizer
text = "this is a foo bar sentences and i want to ngramize it"
vectorizer = CountVectorizer(ngram_range=(1,6))
analyzer = vectorizer.build_analyzer()
print analyzer(text)
Output is
[u'this', u'is', u'foo', u'bar', u'sentences', u'and', u'want', u'to', u'ngramize', u'it', u'this is', u'is foo', u'foo bar', u'bar sentences', u'sentences and', u'and want', u'want to', u'to ngramize', u'ngramize it', u'this is foo', u'is foo bar', u'foo bar sentences', u'bar sentences and', u'sentences and want', u'and want to', u'want to ngramize', u'to ngramize it', u'this is foo bar', u'is foo bar sentences', u'foo bar sentences and', u'bar sentences and want', u'sentences and want to', u'and want to ngramize', u'want to ngramize it', u'this is foo bar sentences', u'is foo bar sentences and', u'foo bar sentences and want', u'bar sentences and want to', u'sentences and want to ngramize', u'and want to ngramize it', u'this is foo bar sentences and', u'is foo bar sentences and want', u'foo bar sentences and want to', u'bar sentences and want to ngramize', u'sentences and want to ngramize it']
Here it gives all the grams given in a range 1 to 6. Its using the method called countVectorizer. Here is the link for that.
Upvotes: 11
Reputation: 6985
Use NLTK (the Natural Language Toolkit) and use the functions to tokenize (split) your text into a list and then find bigrams and trigrams.
import nltk
words = nltk.word_tokenize(my_text)
my_bigrams = nltk.bigrams(words)
my_trigrams = nltk.trigrams(words)
Upvotes: 27