iKyriaki
iKyriaki

Reputation: 609

Creating an index of words

I'm currently trying to create an index of words, reading each line from a text file and checking to see if the word is in that line. If so, it prints out the number line and continues the check. I've gotten it to work how I wanted to when printing each word and line number, but I'm not sure what storage system I could use to contain each number.

Code example:

def index(filename, wordList):
    'string, list(string) ==> string & int, returns an index of words with the line number\
    each word occurs in'
    indexDict = {}
    res = []
    infile = open(filename, 'r')
    count = 0
    line = infile.readline()
    while line != '':
        count += 1
        for word in wordList:
            if word in line:
                #indexDict[word] = [count]
                print(word, count)
        line = infile.readline()
    #return indexDict

This prints the word and whatever the count is at the time (line number), but what I'm trying to do is store the numbers so that later on I can make it print out

word linenumber

word2 linenumber, linenumber

And so on. I felt a dictionary would work for this if I put each line number inside a list so each key can contain more than one value, but the closest I got was this:

{'mortal': [30], 'dying': [9], 'ghastly': [82], 'ghost': [9], 'raven': [120], 'evil': [106], 'demon': [122]}

When I wanted it to show up as:

{'mortal': [30], 'dying': [9], 'ghastly': [82], 'ghost': [9], 'raven': [44, 53, 55, 64, 78, 97, 104, 111, 118, 120], 'evil': [99, 106], 'demon': [122]}

Any ideas?

Upvotes: 0

Views: 11241

Answers (4)

Martijn Pieters
Martijn Pieters

Reputation: 1125368

You need to append your next item to the list, if the list already exists.

The easiest way to have the list already be there even for the first time you find a word, is to use the collections.defaultdict class to track your word-to-lines mapping:

from collections import defaultdict

def index(filename, wordList):
    indexDict = defaultdict(list)
    with open(filename, 'r') as infile:
        for i, line in enumerate(infile):
            for word in wordList:
                if word in line:
                    indexDict[word].append(i)
                    print(word, i)

    return indexDict

I've simplified your code a little using best practices; opening the file as a context manager so it'll close automatically when done, and using enumerate() to create line numbers on the fly.

You could speed this up a little further still (and make it more accurate) if you turned your lines into a set of words (set(line.split()) perhaps, but that won't remove punctuation), as then you could use set intersection tests against wordList (also a set), which could be considerably faster to find matching words.

Upvotes: 1

octref
octref

Reputation: 6801

You are replacing the old value by this line

indexDict[word] = [count]

Changing it to

indexDict[word] = indexDict.setdefault(word, []) + [count]

Will yield the answer you want. It'll get the current value of indexDict[word] and append the new count to it, if there is no indexDict[word], it creates a new empty list and append count to it.

Upvotes: 2

user1795784
user1795784

Reputation:

There is probably a more pythonic way to write this, but just for readability you could try this (a simple example):

dict = {1: [], 2: [], 3: []}

list = [1,2,2,2,3,3]

for k in dict.keys():
    for i in list:
        if i == k:
            dict[k].append(i)


In [7]: dict
Out[7]: {1: [1], 2: [2, 2, 2], 3: [3, 3]}

Upvotes: 2

tobias_k
tobias_k

Reputation: 82949

Try something like this:

import collections
def index(filename, wordList):
    indexDict = collections.defaultdict(list)
    with open(filename) as infile:
        for (i, line) in enumerate(infile.readlines()):
            for word in wordList:
                if word in line:
                    indexDict[word].append(i+1)
    return indexDict

This yields the exact same results as in your example (using Poe's Raven).

Alternatively, you might consider using a normal dict instead of a defaultdict and initialize it with all the words in the list; to make sure that the indexDict contains an entry even for words that are not in the text.

Also, note the use of enumerate. This builtin function is very useful for iterating over both the index and the item at that index of some list (like the lines in the file).

Upvotes: 3

Related Questions