Morningshade
Morningshade

Reputation: 23

Python: Split dictionary values into terms and make dictionary out of it

I have file which consists of numbers - document id`s; and text - document:

1000 The world`s end

1001 This is fine

Need to create term dictionary and postings list. Term dictionary represents documents, just split into terms and paired with the document id. Term dictionary should be, i`m guessing (key: term, value: document_id) Like this:

the = 1000

world`s = 1000

end = 1000

this = 1001

is = 1001

fine = 1001

Postings list represents in which documents is the term located in. Should look like this:

This = 1000 1001

the = 1000 1001

first = 1000

I succeeded only by splitting document into terms (don`t even know if i did it right). What and how to do next step?

Python code

#Open and read documents file
docLine = codecs.open('sample.txt', 'r', 'utf8').read().splitlines()

#Empty dictionary
doc_dictionary = {}

#Split every line in id (keys) and documents (val) to save as dictionary
for document in docLine:
    (key, val) = re.split(r'\t+', document)
    doc_dictionary[key] = val
print("Documents")
print(doc_dictionary)

#Splits documents into words (terms)
print("") 
print("Words")
words = {key: [(val) for val in value.split()] for key, value in doc_dictionary.items()}
print(words)

Result

Documents {

'1000': 'The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen',

'1001': 'This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org', etc.

Words {

'1000': ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Pride', 'and', 'Prejudice,', 'by', 'Jane', 'Austen'],

'1001': ['This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever.', 'You', 'may', 'copy', 'it,', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at', 'www.gutenberg.org'],

Upvotes: 0

Views: 198

Answers (2)

Gamopo
Gamopo

Reputation: 1598

I would loop through the dictionary you created:

result = {}
for key, list in words.items():
    for elem in list:
        if elem in result:
            if not key in result[elem]:
                result[elem].append(key)
        else:
            result[elem] = [key]

I tried it with

words = {'1000': ['the', 'world', 'the'],
         '1001': ['the', 'party']}

and the result:

{'the': ['1000', '1001'], 'world': ['1000'], 'party': ['1001']}

to search a list of terms in the result dictionary you can use this:

for word in to_find:
    if word in result:
        print(word + ': ' + " ".join(result[word]))
    else:
        print(word + ': not found in dict')

an example input: to_find = ['the', 'party', 'car'] gives this output:

the: 1000 1001

party: 1001

car: not found in dict

Upvotes: 1

Sunny Patel
Sunny Patel

Reputation: 8077

From your question it seems like you are trying to swap keys and values of the newly generated dict. This is called indexing, which is what you see at the back of books and how search engines deliver results fast.

Instead of creating multiple dictionaries, you can do it in one iteration by:

from collections import defaultdict

def normalize(line, pattern=re.compile(r"\W*\s+\W*")):
    # Use pattern to split line and trim non-word characters and set to lowercase
    return map(str.lower, pattern.split(line.strip(".!+,")))

index = defaultdict(set)
for document in docLine:
    key, value = re.split(r'\t+', document, 1)  # Split line into key and text parts
    for word in normalize(value):               # Normalize words to be used as index
        index[word].add(key)                    # Add key to word's set

Output

{'almost': {'1001'},
 'and': {'1001', '1000'},
 'anyone': {'1001'},
 'anywhere': {'1001'},
 'at': {'1001'},
 'austen': {'1000'},
 'away': {'1001'},
 'by': {'1000'},
 'copy': {'1001'},
 'cost': {'1001'},
 'ebook': {'1001', '1000'},
 'for': {'1001'},
 'give': {'1001'},
 'gutenberg': {'1001', '1000'},
 'included': {'1001'},
 'is': {'1001'},
 'it': {'1001'},
 'jane': {'1000'},
 'license': {'1001'},
 'may': {'1001'},
 'no': {'1001'},
 'of': {'1001', '1000'},
 'online': {'1001'},
 'or': {'1001'},
 'prejudice': {'1000'},
 'pride': {'1000'},
 'project': {'1001', '1000'},
 're-use': {'1001'},
 'restrictions': {'1001'},
 'terms': {'1001'},
 'the': {'1001', '1000'},
 'this': {'1001'},
 'under': {'1001'},
 'use': {'1001'},
 'whatsoever': {'1001'},
 'with': {'1001'},
 'www.gutenberg.org': {'1001'},     # Notice no trailing period.
 'you': {'1001'}}

Please see my Repl with complete example.

This makes use of defaultdict which ensures that every new key is of a specific type (in this case, a set). to set up the main dictionary.

Upvotes: 1

Related Questions