franklin
franklin

Reputation: 1819

my inverse index is very slow any suggestions?

I have a text file where each line has a bunch of text. (in the actual file there are no line numbers) like this:

line#:     text:
0          This is some text
1          More text
2          whats for lunch

I want a function that returns a dictionary mapping each word to its line number occurrence essentially designing an inverseindex.

i.e. {'This':{1}, 'text':{0,1}, 'for':{2} ... }

After scanning the text file, (this takes .18 seconds) I put the lines into a list of lists, such that each position in the list stores the split line. i.e.:

[['This', 'is', 'some', 'text'], ['More', ...] ...]

After which I use enumerate() to extract the position and create the dictionary. I have a solution already but its so ugly and it took me so long that I want to see another more elegant solution.

For reference my algorithm runs for 882.28 seconds i.e. 15 minutes on 1099 lines and 753210 words. In other words decidedly non pythonic.

def invidx(strlist):
    # return algoritm execution time
    start = time.time()  

    f = open(strlist, 'r')
    wordLoc = []
    for line in f:    
        s = line.split()
        wordLoc.append(list(s)) 
    f.close()

    # benchmark
    print 'job completed in %.2fs' % (time.time() - start) 

    try:
        q = {}
        for a, b in enumerate(wordLoc):
            l = set()
            for w in b :
                if w not in q:
                    l = {a for a, b in enumerate(wordLoc) if w in b}
                    q[w] = l
    except KeyboardInterrupt:
        print 'Interrupt detected: aborting...'
        print 'Failed to complete indexing, ran for %.2fs' % \
            (time.time() - start)
        exit(0)                  

    return q

EDIT:

as per request code is above. Go easy on me guys.

Upvotes: 1

Views: 223

Answers (4)

martineau
martineau

Reputation: 123463

This seems to work and I'm fairly sure it's faster than your version:

from time import time

def invidx(strlist):
    # return algoritm execution time
    start = time()

    wordLocs = []
    unique_words = set()
    with open(strlist, 'r') as f:
        for line in f:
            words = line.split()
            unique_words.update(words)
            wordLocs.append(set(words))

    # benchmark
    print 'job completed in %.2fs' % (time() - start)

    try:
        q = {}
        for unique_word in unique_words:
            occurrences = set()
            for line, words in enumerate(wordLocs):
                if unique_word in words:
                    occurrences.add(line)
            q[unique_word] = occurrences

    except KeyboardInterrupt:
        print ('Interrupt detected: aborting...\n'
              ('Failed to complete indexing, ran for %.2fs' % (time() - start)))
        exit(0)

    return q

from pprint import pprint
pprint(invidx('strlist.txt'))

Output from your trivial test file:

job completed in 0.00s
{'More': set([1]),
 'This': set([0]),
 'for': set([2]),
 'is': set([0]),
 'lunch': set([2]),
 'some': set([0]),
 'text': set([0, 1]),
 'whats': set([2])}

Upvotes: 0

Ashwini Chaudhary
Ashwini Chaudhary

Reputation: 250951

You can use collections.defaultdict:

from collections import defaultdict
dic = defaultdict(set)
with open('abc') as f:
   for i,line in enumerate(f): #enumerate returns the line number as well as the line
       words = line.split()    #splt the line using str.split()
       for word in words:      #iterate over words and add to it's corresponding set
           dic[word.lower()].add(i)
print dic

output:

defaultdict(<type 'set'>,
{'whats': set([2]),
 'for': set([2]),
 'this': set([0]),
 'text': set([0, 1]),
 'is': set([0]),
 'some': set([0]),
 'lunch': set([2]),
 'more': set([1])})

Upvotes: 1

Craig Gidney
Craig Gidney

Reputation: 18276

The line responsible for the slowdown is this one:

l = {a for a, b in enumerate(wordLoc) if w in b}

Every time you find a word you haven't seen yet, you re-enumerate every single line and see if contains the word. This is going to contribute O(NumberOfUniqueWords * NumberOfLines) operations overall, which is quadratic-ish in the size of the input.

You're already enumerating every word of every line. Why not just add them up as you go?

for w in b :
    if w not in q: q[w] = []
    q[w].append(a)

This should take O(NumberOfWords) time, which is linear in the size of the input instead of quadratic (ish). You touch each thing once, instead of once per unique word.

Upvotes: 2

shx2
shx2

Reputation: 64318

You can get the line numbers using enumerate when you scan the file initially, and add line numbers to the dict of sets as you go.

myfile.txt:

a b c
b x y
a c b

index it:

index = {}
with open('myfile.txt') as F:
    for line_num, line in enumerate(F):
        for word in line.split():
            index.setdefault(word, set()).add(line_num)

index
=> {'a': set([0, 2]),
 'b': set([0, 1, 2]),
 'c': set([0, 2]),
 'x': set([1]),
 'y': set([1])}

Upvotes: 3

Related Questions