Reputation: 23
I have file which consists of numbers - document id`s; and text - document:
1000 The world`s end
1001 This is fine
Need to create term dictionary and postings list. Term dictionary represents documents, just split into terms and paired with the document id. Term dictionary should be, i`m guessing (key: term, value: document_id) Like this:
the = 1000
world`s = 1000
end = 1000
this = 1001
is = 1001
fine = 1001
Postings list represents in which documents is the term located in. Should look like this:
This = 1000 1001
the = 1000 1001
first = 1000
I succeeded only by splitting document into terms (don`t even know if i did it right). What and how to do next step?
#Open and read documents file
docLine = codecs.open('sample.txt', 'r', 'utf8').read().splitlines()
#Empty dictionary
doc_dictionary = {}
#Split every line in id (keys) and documents (val) to save as dictionary
for document in docLine:
(key, val) = re.split(r'\t+', document)
doc_dictionary[key] = val
print("Documents")
print(doc_dictionary)
#Splits documents into words (terms)
print("")
print("Words")
words = {key: [(val) for val in value.split()] for key, value in doc_dictionary.items()}
print(words)
Documents {
'1000': 'The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen',
'1001': 'This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org', etc.
Words {
'1000': ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Pride', 'and', 'Prejudice,', 'by', 'Jane', 'Austen'],
'1001': ['This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever.', 'You', 'may', 'copy', 'it,', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at', 'www.gutenberg.org'],
Upvotes: 0
Views: 198
Reputation: 1598
I would loop through the dictionary you created:
result = {}
for key, list in words.items():
for elem in list:
if elem in result:
if not key in result[elem]:
result[elem].append(key)
else:
result[elem] = [key]
I tried it with
words = {'1000': ['the', 'world', 'the'],
'1001': ['the', 'party']}
and the result:
{'the': ['1000', '1001'], 'world': ['1000'], 'party': ['1001']}
to search a list of terms in the result dictionary you can use this:
for word in to_find:
if word in result:
print(word + ': ' + " ".join(result[word]))
else:
print(word + ': not found in dict')
an example input: to_find = ['the', 'party', 'car']
gives this output:
the: 1000 1001
party: 1001
car: not found in dict
Upvotes: 1
Reputation: 8077
From your question it seems like you are trying to swap keys and values of the newly generated dict
. This is called indexing, which is what you see at the back of books and how search engines deliver results fast.
Instead of creating multiple dictionaries, you can do it in one iteration by:
from collections import defaultdict
def normalize(line, pattern=re.compile(r"\W*\s+\W*")):
# Use pattern to split line and trim non-word characters and set to lowercase
return map(str.lower, pattern.split(line.strip(".!+,")))
index = defaultdict(set)
for document in docLine:
key, value = re.split(r'\t+', document, 1) # Split line into key and text parts
for word in normalize(value): # Normalize words to be used as index
index[word].add(key) # Add key to word's set
{'almost': {'1001'}, 'and': {'1001', '1000'}, 'anyone': {'1001'}, 'anywhere': {'1001'}, 'at': {'1001'}, 'austen': {'1000'}, 'away': {'1001'}, 'by': {'1000'}, 'copy': {'1001'}, 'cost': {'1001'}, 'ebook': {'1001', '1000'}, 'for': {'1001'}, 'give': {'1001'}, 'gutenberg': {'1001', '1000'}, 'included': {'1001'}, 'is': {'1001'}, 'it': {'1001'}, 'jane': {'1000'}, 'license': {'1001'}, 'may': {'1001'}, 'no': {'1001'}, 'of': {'1001', '1000'}, 'online': {'1001'}, 'or': {'1001'}, 'prejudice': {'1000'}, 'pride': {'1000'}, 'project': {'1001', '1000'}, 're-use': {'1001'}, 'restrictions': {'1001'}, 'terms': {'1001'}, 'the': {'1001', '1000'}, 'this': {'1001'}, 'under': {'1001'}, 'use': {'1001'}, 'whatsoever': {'1001'}, 'with': {'1001'}, 'www.gutenberg.org': {'1001'}, # Notice no trailing period. 'you': {'1001'}}
Please see my Repl with complete example.
This makes use of defaultdict
which ensures that every new key is of a specific type (in this case, a set
). to set up the main dictionary.
Upvotes: 1