Python: Split dictionary values into terms and make dictionary out of it

Question

I have file which consists of numbers - document id`s; and text - document:

1000 The world`s end

1001 This is fine

Need to create term dictionary and postings list. Term dictionary represents documents, just split into terms and paired with the document id. Term dictionary should be, i`m guessing (key: term, value: document_id) Like this:

the = 1000

world`s = 1000

end = 1000

this = 1001

is = 1001

fine = 1001

Postings list represents in which documents is the term located in. Should look like this:

This = 1000 1001

the = 1000 1001

first = 1000

I succeeded only by splitting document into terms (don`t even know if i did it right). What and how to do next step?

Python code

#Open and read documents file
docLine = codecs.open('sample.txt', 'r', 'utf8').read().splitlines()

#Empty dictionary
doc_dictionary = {}

#Split every line in id (keys) and documents (val) to save as dictionary
for document in docLine:
    (key, val) = re.split(r'	+', document)
    doc_dictionary[key] = val
print("Documents")
print(doc_dictionary)

#Splits documents into words (terms)
print("") 
print("Words")
words = {key: [(val) for val in value.split()] for key, value in doc_dictionary.items()}
print(words)

Result

Documents {

'1000': 'The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen',

'1001': 'This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org', etc.

Words {

'1000': ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Pride', 'and', 'Prejudice,', 'by', 'Jane', 'Austen'],

'1001': ['This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever.', 'You', 'may', 'copy', 'it,', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at', 'www.gutenberg.org'],

Gamopo · Accepted Answer

I would loop through the dictionary you created:

result = {}
for key, list in words.items():
    for elem in list:
        if elem in result:
            if not key in result[elem]:
                result[elem].append(key)
        else:
            result[elem] = [key]

I tried it with

words = {'1000': ['the', 'world', 'the'],
         '1001': ['the', 'party']}

and the result:

{'the': ['1000', '1001'], 'world': ['1000'], 'party': ['1001']}

to search a list of terms in the result dictionary you can use this:

for word in to_find:
    if word in result:
        print(word + ': ' + " ".join(result[word]))
    else:
        print(word + ': not found in dict')

an example input: to_find = ['the', 'party', 'car'] gives this output:

the: 1000 1001

party: 1001

car: not found in dict

Python: Split dictionary values into terms and make dictionary out of it

Python code

Result

Answers (2)

Output

Related Questions