Reputation: 3
I'm creating a very basic search engine in python, I am working on creating a method for handling phrase queries, so if the position of 2 words are within 1 they are next to each other in the document and it will output all document numbers where this happens.
I currently have a dictionary which looks like this
{'8':[['1170', '1264', '1307', '1559', '1638'], ['197', '1169']],
'6':[['345', '772'], ['346']}
This is just a layout example.
w=word, p=position ||
{doc1:[w1p1, w1p2, w1p3],[w2p1, w2p2]}
The key is the document id, followed by the positions in that document that the 1st word contains, then the positions of the 2nd word. There will be as many words (grouping of the positions) as that of in the query.
My questions is, is there a way were i can compare the values of the 1 and 2nd + 3rd etc ... values for the same document ID?. I want to compare them to see if a words position is only +1 of the other word.
So you can see for doc 6 word 2 follows word 1, this would result in the key being sent back.
Upvotes: 0
Views: 170
Reputation: 66
There are a couple ways to achieve what you're trying to do here. I'm assuming based on the example you gave me that there are always only two words, and the lists are always ordered ordered.
No matter what the method, you'll want to iterate over the documents (The dictionary). Iterating over dictionaries is simple in Python; you can see an example here. After that, the steps change
First option - less efficient, slightly simpler:
Compare the two locations, and if they're within 1 of each other, return the document id.
Example:
for documentNumber in docdictionary:
for word1location in docdictionary[documentNumber][0]:
for word2location in docdictionary[documentNumber][1]:
if abs(word1location - word2location) == 1:
return documentNumber
Second Option - more efficient, slightly more complicated:
If one of the lists (ex. list 1) runs out of numbers, and the other list (list 2) is at a value that is greater than the last value of the first (list 1), return None.
Example:
for documentNumber in docdictionary:
list1pos = 0
list2pos = 0
while True:
difference = docdictionary[documentNumber][0][list1pos] - docdictionary[documentNumber][1][list2pos]
if abs(difference) == 1:
return documentNumber
if difference < 0: #Page location 2 is greater
list1pos++
if list1pos == len(docdictionary[documentNumber][0]): #We were at the end of list 1, there will be no more matches
break
else: #Page location 1 is greater
list2pos++
if list2pos == len(docdictionary[documentNumber][1]): #We were at the end of list 2, there will be no more matches
break
return None
As a reminder, option 2 only works if the lists are always sorted. Also, you don't always need to return the document id right away. You could just add the document id to a list if you want all the documents that the pair happens in instead of the first one it finds. You could even use a dictionary to easily keep track of how many times the word pair appears in each document.
Hope this helped! Please let me know if anything isn't clear.
Upvotes: 1