Reputation: 79
I have created the following class to implement an inverted index in Python. I read questions from the quora question pair challenge. The questions are in this form:
---------------------------
qid |question
---------------------------
1 |Why do we exist?
2 |Is there life on Mars?
3 |What happens after death?
4 |Why are bananas yellow?
The problem is that I want the qid to get passed along with each word inside the inverted index so that I know after it gets created which question each word comes from, and access it easily.
class Index:
""" Inverted index datastructure """
def __init__(self):
self.index = defaultdict(list)
self.documents = {}
self.__unique_id = 0
def lookup(self, word):
"""
Lookup a word in the index
"""
word = word.lower()
if self.stemmer:
word = self.stemmer.stem(word)
return [self.documents.get(id, None) for id in self.index.get(word)]
def addProcessed(self, words):
"""
Add a document string to the index
"""
for word in words:
if self.__unique_id not in self.index[word]:
self.index[word].append(self.__unique_id)
self.documents[self.__unique_id] = words
self.__unique_id += 1
How could I implement this in my above data structure?
Upvotes: 1
Views: 206
Reputation: 38982
A straightforward way to get qid
into your index is to write Index.addProcessed
to receive qid
as a second argument and include that in the value set for unique_id
key in the documents.
def addProcessed(self, words, qid):
#...
self.documents[self.__unique_id] = (words, qid)
self.__unique_id += 1
Index.lookup
will then return a list of tuples consisting of words and their question id.
Upvotes: 1