Efficient shingling algorithm

Question

Im have implemented the following algorithm to extract groups of n words from a string.

def ngrams(text, size):
    tokens = text.split()
    ngrams = []
    for char in range(len(tokens)):
        if (len(tokens)-char) < size:
            break
        list_shingle = tokens[char:char+size]
        str_shingle = ' '.join(list_shingle)
        ngrams.append(str_shingle)
    return ngrams

The strings have such form:

['Hello my name is Inigo Montoya, you killed my father, prepare to die.']

The output should look like this, for a size of 3 words:

['Hello my name','my name is','name is Inigo',...,'prepare to die.']

I have to compare a large amount of documents, how can I make this code more efficient?

rici · Accepted Answer

This isn't a different algorithm, but it uses a list comprehension instead of a loop:

def ngrams2(text, size):
    tokens = text.split()
    return [' '.join(tokens[i:i+size])
                     for i in range(len(tokens) - size + 1)]

At least on my machine, the change pays off:

$ python3 -m timeit -s 'from shingle import ngrams, ngrams2, text' 'ngrams(text, 3)'
1000 loops, best of 3: 501 usec per loop
$ python3 -m timeit -s 'from shingle import ngrams, ngrams2, text' 'ngrams2(text, 3)'
1000 loops, best of 3: 293 usec per loop

(text is 10kb generated by https://www.lipsum.com/. It contains 1347 "words" in a single line.)

Efficient shingling algorithm

Answers (1)

Related Questions