Python 3 Dictionary for Weighted Inverted Index

Question

First, this is homework, so I would just like suggestions, please. I am writing a program that generates a weighted inverted index. The weighted inverted index is a dictionary with a word as a key; the value is a list of lists, with each item in the list containing the document number, and the number of times that word appears in the document.

For example,

{"a": [[1, 2],[2,1]]}
The word "a" appears twice in document 1 and once in document 2.

I am practicing with two small files.

file1.txt:

    Where should I go
    When I want to have
    A smoke,
    A pancake, 
    and a nap.

file2.txt:

I do not know
Where my pancake is
I want to take a nap.

Here is my program code:

def cleanData(myFile):
    file = open(myFile, "r")

    data = file.read()
    wordList = []

    #All numbers and end-of-sentence punctuation
    #replaced with the empty string
    #No replacement of apostrophes
    formattedData = data.strip().lower().replace(",","")\
                 .replace(".","").replace("!","").replace("?","")\
                 .replace(";","").replace(":","").replace('"',"")\
                 .replace("1","").replace("2","").replace("3","")\
                 .replace("4","").replace("5","").replace("6","")\
                 .replace("7","").replace("8","").replace("9","")\
                 .replace("0","")

    words = formattedData.split() #creates a list of all words in the document
    for word in words:
        wordList.append(word)     #adds each word in a document to the word list
    return wordList

def main():

fullDict = {}

files = ["file1.txt", "file2.txt"]
docNumber = 1

for file in files:
    wordList = cleanData(file)

    for word in wordList:
        if word not in fullDict:
            fullDict[word] = []
            fileList = [docNumber, 1]
            fullDict[word].append(fileList)
        else:
            listOfValues = list(fullDict.values())
            for x in range(len(listOfValues)):
                if docNumber == listOfValues[x][0]:
                    listOfValues[x][1] +=1
                    fullDict[word] = listOfValues
                    break
            fileList = [docNumber,1]
            fullDict[word].append(fileList)

    docNumber +=1
return fullDict

What I am trying to do is generate something like this:

{"a": [[1,3],[2,1]], "nap": [[1,1],[2,1]]}

What I am getting is this:

{"a": [[1,1],[1,1],[1,1],[2,1]], "nap": [[1,1],[2,1]]}

It records all occurrences of each word in all documents, but it records repeats separately. I cannot figure this out. Any help would be appreciated! Thank you in advance. :)

thefourtheye · Accepted Answer

There are two main problems in your code.

Problem 1

        listOfValues = list(fullDict.values())
        for x in range(len(listOfValues)):
            if docNumber == listOfValues[x][0]:

Here, you take all the values of the dictionary, irrespective of the current word, and incrementing the count, but you should be incrementing the count in the lists corresponding to the current word. So, you should change it to

listOfValues = fullDict[word]

Problem 2

        fileList = [docNumber,1]
        fullDict[word].append(fileList)

apart from incrementing the count for all the words, you are adding a new value to the fullDict always. But you should be adding it, only if the docNumber is not already there in the listOfValues. So, you can use an else with the for loop, like this

    for word in wordList:
        if word not in fullDict:
            ....
        else:
            listOfValues = fullDict[word]
            for x in range(len(listOfValues)):
                ....
            else:
                fileList = [docNumber, 1]
                fullDict[word].append(fileList)

After making these two changes, I got the following output

{'a': [[1, 3], [2, 1]],
 'and': [[1, 1]],
 'do': [[2, 1]],
 'go': [[1, 1]],
 'have': [[1, 1]],
 'i': [[1, 2], [2, 2]],
 'is': [[2, 1]],
 'know': [[2, 1]],
 'my': [[2, 1]],
 'nap': [[1, 1], [2, 1]],
 'not': [[2, 1]],
 'pancake': [[1, 1], [2, 1]],
 'should': [[1, 1]],
 'smoke': [[1, 1]],
 'take': [[2, 1]],
 'to': [[1, 1], [2, 1]],
 'want': [[1, 1], [2, 1]],
 'when': [[1, 1]],
 'where': [[1, 1], [2, 1]]}

There are few suggestions to improve your code.

Instead of using lists to store the document number and the count, you can actually use a dictionary. That would make your life easier.
Instead of counting manually, you can use collections.Counter.
Instead of using multiple replaces, you can use a simple regular expression, like this
```
formattedData = re.sub(r'[.!?;:"0-9]', '', data.strip().lower())
```

If I were to clean the cleanData, I would do it like this

import re
def cleanData(myFile):
    with open(myFile, "r") as input_file:
        data = input_file.read()
    return re.sub(r'[.!?;:"0-9]', '', data.strip().lower()).split()

In the main loop, you can use the improvements suggested by Brad Budlong, like this

def main():
    fullDict = {}
    files = ["file1.txt", "file2.txt"]
    for docNumber, currentFile in enumerate(files, 1):
        for word in cleanData(currentFile):
            if word not in fullDict:
                fullDict[word] = [[docNumber, 1]]
            else:
                for x in fullDict[word]:
                    if docNumber == x[0]:
                        x[1] += 1
                        break
                else:
                    fullDict[word].append([docNumber, 1])
    return fullDict

Python 3 Dictionary for Weighted Inverted Index

Answers (2)

Related Questions