user3247054
user3247054

Reputation: 691

Build a dictionary from list of lists

I am trying to build an inverted index, i.e. map a text to the document it came from. It's position within the list/document.

In my case i have parsed list containing lists(i.e list of lists).

My input is like this.

        [
        ['why', 'was', 'cinderella', 'late', 'for', 'the', 'ball', 'she', 'forgot', 'to', 'swing', 'the', 'bat'],
        ['why', 'is', 'the', 'little', 'duck', 'always', 'so', 'sad', 'because', 'he', 'always', 'sees', 'a', 'bill', 'in', 'front', 'of', 'his', 'face'],
        ['what', 'has', 'four', 'legs', 'and', 'goes', 'booo', 'a', 'cow', 'with', 'a', 'cold'], 
        ['what', 'is', 'a', 'caterpillar', 'afraid', 'of', 'a', 'dogerpillar'],
        ['what', 'did', 'the', 'crop', 'say', 'to', 'the', 'farmer', 'why', 'are', 'you', 'always', 'picking', 'on', 'me']
        ]

This is my code

def create_inverted(mylists):
    myDict = {}
    for sublist in mylists: 
        for i in range(len(sublist)):
            if sublist[i] in myDict:
                myDict[sublist[i]].append(i)
            else:
                myDict[sublist[i]] = [i]

    return myDict

It does build the dictionary, but when i do a search i am not getting the correct result. I am trying to do something like this.

documents = [['owl', 'lion'], ['lion', 'deer'], ['owl', 'leopard']]

index = {'owl': [0, 2],
         'lion': [0, 1],  # IDs are sorted. 
         'deer': [1],
         'leopard': [2]}

def indexed_search(documents, index, query):
   return [documents[doc_id] for doc_id in index[query]]

print indexed_search(documents, index, 'lion')

Where i can enter search text and it gets the list ids.

Any Ideas.

Upvotes: 0

Views: 154

Answers (4)

user2357112
user2357112

Reputation: 280311

You're mapping each word to the positions it was found in in each document, not which document it was found in. You should store indexes into the list of documents instead of indexes into the documents themselves, or perhaps just map words to documents directly instead of to indices:

def create_inverted_index(documents):
    index = {}
    for i, document in enumerate(documents):
        for word in set(document):
            if word in index:
                index[word].append(i)
            else:
                index[word] = [i]
    return index

Most of this is the same as your code. The main differences are in these two lines:

    for i, document in enumerate(documents):
        for word in set(document):

which correspond to the following part of your code:

    for sublist in mylists: 
        for i in range(len(sublist)):

enumerate iterates over the indices and elements of a sequence. Since enumerate is on the outer loop, i in my code is the index of the document, while i in your code is the index of a word within a document.

set(document) creates a set of the words in the document, where each word appears only once. This ensures that each word is only counted once per document, rather than having 10 occurrences of 2 in the list for 'Cheetos' if 'Cheetos' appears in document 2 10 times.

Upvotes: 1

John La Rooy
John La Rooy

Reputation: 304137

I'd accumulate the indices into a set to avoid duplicates and then sort

>>> documents = [['owl', 'lion'], ['lion', 'deer'], ['owl', 'leopard']]
>>> from collections import defaultdict
>>> D = defaultdict(set)
>>> for i, doc in enumerate(documents):
...     for word in doc:
...         D[word].add(i)
... 
>>> D ## Take a look at the defaultdict
defaultdict(<class 'set'>, {'owl': {0, 2}, 'leopard': {2}, 'lion': {0, 1}, 'deer': {1}})
>>> {k:sorted(v) for k,v in D.items()}
{'lion': [0, 1], 'owl': [0, 2], 'leopard': [2], 'deer': [1]}

Upvotes: 0

michaelmeyer
michaelmeyer

Reputation: 8205

This is straightforward as long you don't need efficient code:

documents = [['owl', 'lion'], ['lion', 'deer'], ['owl', 'leopard']]

def index(docs):
    doc_index = {}
    for doc_id, doc in enumerate(docs, 1):
        for term_pos, term in enumerate(doc, 1):
            doc_index.setdefault(term, {}).setdefault(doc_id, []).append(term_pos)
    return doc_index

Now you get a two-level dictionary giving you access to the document ids, and then to the positions of the terms in this document:

>>> index(documents)
{'lion': {1: [2], 2: [1]}, 'leopard': {3: [2]}, 'deer': {2: [2]}, 'owl': {1: [1], 3: [1]}}

This is only a preliminary step for indexing; afterwards, you need to separate the term dictionary from the document postings from the positions postings. Typically, the dictionary is stored in a tree-like structures (there are Python packages for this), and the document postings and positions postings are represented as arrays of unsigned integers.

Upvotes: 0

koffein
koffein

Reputation: 1882

At first I would extract all possible words and store them in one set. Then I look up each word in each list and collect all the indexes of lists the word happens to be in...

source = [
['why', 'was', 'cinderella', 'late', 'for', 'the', 'ball', 'she', 'forgot', 'to', 'swing', 'the', 'bat'],
['why', 'is', 'the', 'little', 'duck', 'always', 'so', 'sad', 'because', 'he', 'always', 'sees', 'a', 'bill', 'in', 'front', 'of', 'his', 'face'],
['what', 'has', 'four', 'legs', 'and', 'goes', 'booo', 'a', 'cow', 'with', 'a', 'cold'], 
['what', 'is', 'a', 'caterpillar', 'afraid', 'of', 'a', 'dogerpillar'],
['what', 'did', 'the', 'crop', 'say', 'to', 'the', 'farmer', 'why', 'are', 'you', 'always', 'picking', 'on', 'me']
]

allWords = set(word for lst in source for word in lst)

wordDict = { word: [
                    i for i, lst in enumerate(source) if word in lst
                    ] for word in allWords }

print wordDict
Out[30]: 
{'a': [1, 2, 3],
 'afraid': [3],
 'always': [1, 4],
 'and': [2],
 ...

Upvotes: 0

Related Questions