simitar
simitar

Reputation: 3

comparing python dictionary values

I'm creating a very basic search engine in python, I am working on creating a method for handling phrase queries, so if the position of 2 words are within 1 they are next to each other in the document and it will output all document numbers where this happens.

I currently have a dictionary which looks like this

{'8':[['1170', '1264', '1307', '1559', '1638'], ['197', '1169']], 
 '6':[['345', '772'], ['346']}

This is just a layout example.

w=word, p=position ||
{doc1:[w1p1, w1p2, w1p3],[w2p1, w2p2]}

The key is the document id, followed by the positions in that document that the 1st word contains, then the positions of the 2nd word. There will be as many words (grouping of the positions) as that of in the query.

My questions is, is there a way were i can compare the values of the 1 and 2nd + 3rd etc ... values for the same document ID?. I want to compare them to see if a words position is only +1 of the other word.

So you can see for doc 6 word 2 follows word 1, this would result in the key being sent back.

Upvotes: 0

Views: 170

Answers (1)

Gimson
Gimson

Reputation: 66

There are a couple ways to achieve what you're trying to do here. I'm assuming based on the example you gave me that there are always only two words, and the lists are always ordered ordered.

No matter what the method, you'll want to iterate over the documents (The dictionary). Iterating over dictionaries is simple in Python; you can see an example here. After that, the steps change

First option - less efficient, slightly simpler:

  1. Iterate over each item (location) in list 1 (the locations of the first word).
  2. Iterate over each item (location) in list 2 (the locations of the second word).
  3. Compare the two locations, and if they're within 1 of each other, return the document id.

    Example:

    for documentNumber in docdictionary:
        for word1location in docdictionary[documentNumber][0]:
            for word2location in docdictionary[documentNumber][1]:
                if abs(word1location - word2location) == 1:
                    return documentNumber
    

Second Option - more efficient, slightly more complicated:

  1. Start at the beginning of each list of word locations, keeping track of where you are
  2. Check the two values at the locations you're at.
    • If the two values are 1 word apart, return the document number
    • If the two values are not, check which list item (page position), has a lower value and move to the next item in that list, repeat
  3. If one of the lists (ex. list 1) runs out of numbers, and the other list (list 2) is at a value that is greater than the last value of the first (list 1), return None.

    Example:

    for documentNumber in docdictionary:
        list1pos = 0
        list2pos = 0
        while True:
            difference = docdictionary[documentNumber][0][list1pos] - docdictionary[documentNumber][1][list2pos]
            if abs(difference) == 1:
                return documentNumber
            if difference < 0: #Page location 2 is greater
                list1pos++
                if list1pos == len(docdictionary[documentNumber][0]): #We were at the end of list 1, there will be no more matches
                    break
            else: #Page location 1 is greater
                list2pos++
                if list2pos == len(docdictionary[documentNumber][1]): #We were at the end of list 2, there will be no more matches
                    break
    return None
    

As a reminder, option 2 only works if the lists are always sorted. Also, you don't always need to return the document id right away. You could just add the document id to a list if you want all the documents that the pair happens in instead of the first one it finds. You could even use a dictionary to easily keep track of how many times the word pair appears in each document.

Hope this helped! Please let me know if anything isn't clear.

Upvotes: 1

Related Questions