Reputation: 77
Trying to run the following code I get a Key Error ln 12:
import math
from collections import Counter
def retrieve():
wordFrequency = {'bit':{1:3,2:4,3:19,4:0},'red':{1:0,2:0,3:15,4:0},'dog':{1:3,2:0,3:4,4:5}}
search = {'bit':1,'dog':3,'shoe':5}
sizeFileVec = {}
for word, innerDict in wordFrequency.iteritems():
for fileNum, appearances in innerDict.iteritems():
sizeFileVec[fileNum] += appearances ** 2
for fileNum in sizeFileVec:
sizeFileVec[fileNum] = math.sqrt(sizeFileVec[fileNum])
results = []
for word, occurrences in search.iteritems():
file_relevancy = Counter()
for fileNum, appear_in_file in wordFrequency.get(word, {}).iteritems():
file_relevancy[fileNum] += (occurrences * appear_in_file) / sizeFileVec[fileNum]
results = [fileNum for (fileNum, count) in file_relevancy.most_common()]
return results
print retrieve()
The code I am having an error with is supposed to take the inner dictionary of wordFrequency and then sum the squares of the values of each file number then square root this (there are 4 files) i.e. for file 1 it is sqrt(3^2 + 0^2 + 3^2).
results []
is supposed to return a list of the 4 files in order of most relevant based on the query. So in this example:
bit dog shoe
File 1 3 3 0
File 2 4 0 0
File 3 19 4 0
File 4 0 5 0
Search 1 3 5
sim(1,S) = (3 * 1) + (3 * 3) + (0 * 5) / sqrt(3^2 + 3^2 + 0^2) * sqrt(1^2 + 3^2 + 5^2) = 0.478
The scalar product of each term is taken, then this is divided by the product of the magnitudes of the file and search.
This is done between the other 3 files and the search and stored in a list.
The list is then returned in order most relevant to least.
sim(2,S) = (4 * 1) + (0 * 3) + (0 * 5) / sqrt(4^2 + 0^2 + 0^2) * sqrt(1^2 + 3^2 + 5^2) = 0.169
sim(3,S) = (19 * 1) + (4 * 3) + (0 * 5) / sqrt(19^2 + 4^2 +0^2) * sqrt(1^2 + 3^2 + 5^2) = 0.26987
sim(4,S) = (0 * 1) + (5 * 3) + (0 * 5) / sqrt(0^2 + 5^2 + 0^2) * sqrt(1^2 + 3^2 + 5^2) = 0.507
Therefore [4,1,3,2] should be returned
Upvotes: 1
Views: 501
Reputation: 2396
sizeFileVec = {}
for word, innerDict in wordFrequency.iteritems():
for fileNum, appearances in innerDict.iteritems():
sizeFileVec[fileNum] += appearances ** 2
This is wrong because the key doesn't yet exist, so python wouldn't know what to increment toappearance**2
You could do something like,
sizeFileVec = {}
for word, innerDict in wordFrequency.iteritems():
for fileNum, appearances in innerDict.iteritems():
if not sizeFileVec.has_key(filenum):
sizeFileVec[filenum] = 0 #your default value
sizeFileVec[fileNum] += appearances ** 2
(or use setdefault
builtin method for the same effect). You need to make the same changes in line 18 where you repeat the above mistake.
Upvotes: 1