HonsTh
HonsTh

Reputation: 65

Converting list to dictionary, and tokenizing the key values - possible?

So basically I have a folder of files I'm opening and reading into python.

I want to search these files and count the keywords in each file, to make a dataframe like the attached image.

enter image description here I have managed to open and read these files into a list, but my problem is as follows:

Edit 1:

I decided to try and import the files as a dictionary instead. It works, but when I try to lower-case the values, I get a 'list' object attribute error - even though in my variable explorer, it's defined as a dictionary.

import os
filenames = os.listdir('.')
file_dict = {}
for file in filenames:
    with open(file) as f:
        items = [i.strip() for i in f.read().split(",")]
    file_dict[file.replace(".txt", "")] = items

def lower_dict(d):
   new_dict = dict((k, v.lower()) for k, v in d.items())
   return new_dict
print(lower_dict(file_dict))


output =

AttributeError: 'list' object has no attribute 'lower'

Pre-edit post:

1. Each list value doesn't retain the filename key. So I don't have the rows I need.

2. I can't conduct a search of keywords in the list anyway, because it is not tokenized. So I can't count the keywords per file.

Here's my code for opening the files, converting them to lowercase and storing them in a list.

How can I transform this into a dictionary retaining the filename, and tokenized key values?. Additionally, is it better to somehow import the file and contents into a dictionary directly? Can I still tokenize and lower-case everything?

import os
import nltk
# create list of filenames to loop over
filenames = os.listdir('.')
#create an empty list for storage 
Lcase_content = []
tokenized = []
num = 0
# read files from folder, convert to lower case 
for filename in filenames:  
    if filename.endswith(".txt"): 
        with open(os.path.join('.', filename)) as file: 
            content = file.read()   
            # convert to lower-case value 
            Lcase_content.append(content.lower())

       ## this two lines below don't work - index out of range error
            tokenized[num] = nltk.tokenize.word_tokenize(tokenized[num])
            num = num + 1

Upvotes: 0

Views: 2184

Answers (1)

Coding4pho
Coding4pho

Reputation: 86

You can compute the count of each token by using Collections. collections.Counter can take a list of strings and return a dictionary-like Counter with each token in its keys and the count of the tokens in values. Since NLTK's workd_tokenize takes a sequence of strings and returns a list, to get a dictionary with tokens and their counts, you can basically do this:

Counter(nltk.tokenize.word_tokenize())

Since you want your file names as index (first column), make it as a nested dictionary, with a file name as a key and another dictionary with tokens and counts as a value, which looks like this:

{'file1.txt': Counter({'cat': 4, 'dog': 0, 'squirrel': 12, 'sea horse': 3}), 'file2.txt': Counter({'cat': 11, 'dog': 4, 'squirrel': 17, 'sea horse': 0})}

If you are familiar with Pandas, you can convert your dictionary to a Pandas dataframe. It will make your life so much easier to work with any tsv/csv/excel file by exporting the Pandas dataframe result as a csv file. Make sure you apply .lower() to your file content and include orient='index' so that files names be your index.

import os
import nltk
from collections import Counter
import pandas as pd

result = dict()

filenames = os.listdir('.')
for filename in filenames:
    if filename.endswith(".txt"):
        with open(os.path.join('.', filename)) as file:
            content = file.read().lower()
            result[filename] = Counter(nltk.tokenize.word_tokenize(content))

df = pd.DataFrame.from_dict(result, orient='index').fillna(0)

df['total words'] = df.sum(axis=1)

df.to_csv('words_count.csv', index=True)

Re: your first attempt, since your 'items' is a list (see [i.strip() for i in f.read().split(",")]), you can't apply .lower() to it.

Re: your second attempt, your 'tokenized' is empty as it was initialized as tokenized = []. That's why when you try to do tokenized[num] = nltk.tokenize.word_tokenize(tokenized[num]), tokenized[num] with num = 0 gives you the index out of range error.

Upvotes: 1

Related Questions