Reputation: 609
I'm currently trying to create an index of words, reading each line from a text file and checking to see if the word is in that line. If so, it prints out the number line and continues the check. I've gotten it to work how I wanted to when printing each word and line number, but I'm not sure what storage system I could use to contain each number.
Code example:
def index(filename, wordList):
'string, list(string) ==> string & int, returns an index of words with the line number\
each word occurs in'
indexDict = {}
res = []
infile = open(filename, 'r')
count = 0
line = infile.readline()
while line != '':
count += 1
for word in wordList:
if word in line:
#indexDict[word] = [count]
print(word, count)
line = infile.readline()
#return indexDict
This prints the word and whatever the count is at the time (line number), but what I'm trying to do is store the numbers so that later on I can make it print out
word linenumber
word2 linenumber, linenumber
And so on. I felt a dictionary would work for this if I put each line number inside a list so each key can contain more than one value, but the closest I got was this:
{'mortal': [30], 'dying': [9], 'ghastly': [82], 'ghost': [9], 'raven': [120], 'evil': [106], 'demon': [122]}
When I wanted it to show up as:
{'mortal': [30], 'dying': [9], 'ghastly': [82], 'ghost': [9], 'raven': [44, 53, 55, 64, 78, 97, 104, 111, 118, 120], 'evil': [99, 106], 'demon': [122]}
Any ideas?
Upvotes: 0
Views: 11241
Reputation: 1125368
You need to append your next item to the list, if the list already exists.
The easiest way to have the list already be there even for the first time you find a word, is to use the collections.defaultdict
class to track your word-to-lines mapping:
from collections import defaultdict
def index(filename, wordList):
indexDict = defaultdict(list)
with open(filename, 'r') as infile:
for i, line in enumerate(infile):
for word in wordList:
if word in line:
indexDict[word].append(i)
print(word, i)
return indexDict
I've simplified your code a little using best practices; opening the file as a context manager so it'll close automatically when done, and using enumerate()
to create line numbers on the fly.
You could speed this up a little further still (and make it more accurate) if you turned your lines into a set of words (set(line.split())
perhaps, but that won't remove punctuation), as then you could use set intersection tests against wordList
(also a set), which could be considerably faster to find matching words.
Upvotes: 1
Reputation: 6801
You are replacing the old value by this line
indexDict[word] = [count]
Changing it to
indexDict[word] = indexDict.setdefault(word, []) + [count]
Will yield the answer you want. It'll get the current value of indexDict[word] and append the new count to it, if there is no indexDict[word], it creates a new empty list and append count to it.
Upvotes: 2
Reputation:
There is probably a more pythonic way to write this, but just for readability you could try this (a simple example):
dict = {1: [], 2: [], 3: []}
list = [1,2,2,2,3,3]
for k in dict.keys():
for i in list:
if i == k:
dict[k].append(i)
In [7]: dict
Out[7]: {1: [1], 2: [2, 2, 2], 3: [3, 3]}
Upvotes: 2
Reputation: 82949
Try something like this:
import collections
def index(filename, wordList):
indexDict = collections.defaultdict(list)
with open(filename) as infile:
for (i, line) in enumerate(infile.readlines()):
for word in wordList:
if word in line:
indexDict[word].append(i+1)
return indexDict
This yields the exact same results as in your example (using Poe's Raven).
Alternatively, you might consider using a normal dict
instead of a defaultdict
and initialize it with all the words in the list; to make sure that the indexDict
contains an entry even for words that are not in the text.
Also, note the use of enumerate
. This builtin function is very useful for iterating over both the index and the item at that index of some list (like the lines in the file).
Upvotes: 3