Reputation: 165
Im have implemented the following algorithm to extract groups of n words from a string.
def ngrams(text, size):
tokens = text.split()
ngrams = []
for char in range(len(tokens)):
if (len(tokens)-char) < size:
break
list_shingle = tokens[char:char+size]
str_shingle = ' '.join(list_shingle)
ngrams.append(str_shingle)
return ngrams
The strings have such form:
['Hello my name is Inigo Montoya, you killed my father, prepare to die.']
The output should look like this, for a size of 3 words:
['Hello my name','my name is','name is Inigo',...,'prepare to die.']
I have to compare a large amount of documents, how can I make this code more efficient?
Upvotes: 1
Views: 1684
Reputation: 241721
This isn't a different algorithm, but it uses a list comprehension instead of a loop:
def ngrams2(text, size):
tokens = text.split()
return [' '.join(tokens[i:i+size])
for i in range(len(tokens) - size + 1)]
At least on my machine, the change pays off:
$ python3 -m timeit -s 'from shingle import ngrams, ngrams2, text' 'ngrams(text, 3)'
1000 loops, best of 3: 501 usec per loop
$ python3 -m timeit -s 'from shingle import ngrams, ngrams2, text' 'ngrams2(text, 3)'
1000 loops, best of 3: 293 usec per loop
(text
is 10kb generated by https://www.lipsum.com/. It contains 1347 "words" in a single line.)
Upvotes: 3