Reputation: 339
First, this is homework, so I would just like suggestions, please. I am writing a program that generates a weighted inverted index. The weighted inverted index is a dictionary with a word as a key; the value is a list of lists, with each item in the list containing the document number, and the number of times that word appears in the document.
For example,
{"a": [[1, 2],[2,1]]}
The word "a" appears twice in document 1 and once in document 2.
I am practicing with two small files.
file1.txt:
Where should I go
When I want to have
A smoke,
A pancake,
and a nap.
file2.txt:
I do not know
Where my pancake is
I want to take a nap.
Here is my program code:
def cleanData(myFile):
file = open(myFile, "r")
data = file.read()
wordList = []
#All numbers and end-of-sentence punctuation
#replaced with the empty string
#No replacement of apostrophes
formattedData = data.strip().lower().replace(",","")\
.replace(".","").replace("!","").replace("?","")\
.replace(";","").replace(":","").replace('"',"")\
.replace("1","").replace("2","").replace("3","")\
.replace("4","").replace("5","").replace("6","")\
.replace("7","").replace("8","").replace("9","")\
.replace("0","")
words = formattedData.split() #creates a list of all words in the document
for word in words:
wordList.append(word) #adds each word in a document to the word list
return wordList
def main():
fullDict = {}
files = ["file1.txt", "file2.txt"]
docNumber = 1
for file in files:
wordList = cleanData(file)
for word in wordList:
if word not in fullDict:
fullDict[word] = []
fileList = [docNumber, 1]
fullDict[word].append(fileList)
else:
listOfValues = list(fullDict.values())
for x in range(len(listOfValues)):
if docNumber == listOfValues[x][0]:
listOfValues[x][1] +=1
fullDict[word] = listOfValues
break
fileList = [docNumber,1]
fullDict[word].append(fileList)
docNumber +=1
return fullDict
What I am trying to do is generate something like this:
{"a": [[1,3],[2,1]], "nap": [[1,1],[2,1]]}
What I am getting is this:
{"a": [[1,1],[1,1],[1,1],[2,1]], "nap": [[1,1],[2,1]]}
It records all occurrences of each word in all documents, but it records repeats separately. I cannot figure this out. Any help would be appreciated! Thank you in advance. :)
Upvotes: 3
Views: 827
Reputation: 239493
There are two main problems in your code.
Problem 1
listOfValues = list(fullDict.values())
for x in range(len(listOfValues)):
if docNumber == listOfValues[x][0]:
Here, you take all the values of the dictionary, irrespective of the current word, and incrementing the count, but you should be incrementing the count in the lists corresponding to the current word. So, you should change it to
listOfValues = fullDict[word]
Problem 2
fileList = [docNumber,1]
fullDict[word].append(fileList)
apart from incrementing the count for all the words, you are adding a new value to the fullDict
always. But you should be adding it, only if the docNumber
is not already there in the listOfValues
. So, you can use an else
with the for
loop, like this
for word in wordList:
if word not in fullDict:
....
else:
listOfValues = fullDict[word]
for x in range(len(listOfValues)):
....
else:
fileList = [docNumber, 1]
fullDict[word].append(fileList)
After making these two changes, I got the following output
{'a': [[1, 3], [2, 1]],
'and': [[1, 1]],
'do': [[2, 1]],
'go': [[1, 1]],
'have': [[1, 1]],
'i': [[1, 2], [2, 2]],
'is': [[2, 1]],
'know': [[2, 1]],
'my': [[2, 1]],
'nap': [[1, 1], [2, 1]],
'not': [[2, 1]],
'pancake': [[1, 1], [2, 1]],
'should': [[1, 1]],
'smoke': [[1, 1]],
'take': [[2, 1]],
'to': [[1, 1], [2, 1]],
'want': [[1, 1], [2, 1]],
'when': [[1, 1]],
'where': [[1, 1], [2, 1]]}
There are few suggestions to improve your code.
Instead of using lists to store the document number and the count, you can actually use a dictionary. That would make your life easier.
Instead of counting manually, you can use collections.Counter
.
Instead of using multiple replaces, you can use a simple regular expression, like this
formattedData = re.sub(r'[.!?;:"0-9]', '', data.strip().lower())
If I were to clean the cleanData
, I would do it like this
import re
def cleanData(myFile):
with open(myFile, "r") as input_file:
data = input_file.read()
return re.sub(r'[.!?;:"0-9]', '', data.strip().lower()).split()
In the main
loop, you can use the improvements suggested by Brad Budlong, like this
def main():
fullDict = {}
files = ["file1.txt", "file2.txt"]
for docNumber, currentFile in enumerate(files, 1):
for word in cleanData(currentFile):
if word not in fullDict:
fullDict[word] = [[docNumber, 1]]
else:
for x in fullDict[word]:
if docNumber == x[0]:
x[1] += 1
break
else:
fullDict[word].append([docNumber, 1])
return fullDict
Upvotes: 2
Reputation: 1785
My preferred implementation of the for loops doesn't iterate using a len and range functions. Since these are all mutable lists, you don't need to know the index, you just need to have each of the lists and then can be modified without the index. I replaced the for loop with the following and get the same output as thefourtheye.
for word in wordList:
if word not in fullDict:
fullDict[word] = [[docNumber, 1]]
else:
for val in fullDict[word]:
if val[0] == docNumber:
val[1] += 1
break
else:
fullDict[word].append([docNumber, 1])
Upvotes: 1