Reputation: 51
I have a large collection of text documents that I want to loop through and output a count of particular words in these documents in a simple dataframe of article title and word occurrence count. However my output dataframe is clearly incorrect. I suspect I'm doing something silly with the code. Would anyone be able to help identify the problem?
I collect the articles using the glob package and then loop through them using a count function. However my output is giving me patently wrong answers such as counts of '1' or '0' for the occurrences of simple things like the word "we" in very large documents.
import glob
articles = glob.glob('Test/*.txt')
we_dict = {}
for article in articles:
we_dict[article] = article.count("we")
we = pd.DataFrame.from_dict(we_dict, orient='index', dtype=None)
There's no error messages produced, so the code is doing something - a dataframe is produced. But the count values outputted should be in the hundreds instead of small numbers such as 0, 1, 2.
EDIT:
Working version for future readers with the same query thanks to the very helpful responses. I'm sure the code could be simplified somewhat.
import glob
import re
articles = glob.glob('Test/*.txt')
we_dict = {}
for article in articles:
with open(article, 'r', encoding="utf8") as art:
a = art.read()
a = a.lower()
we_dict[article] = sum(1 for match in re.finditer(r"\bwe\b", a))
we = pd.DataFrame.from_dict(we_dict, orient='index', dtype=None)
Upvotes: 1
Views: 55
Reputation: 2349
Right now, your code is iterating through your list of articles and declaring article
as the filename. The line we_dict[article] = article.count("we")
is actually taking your filename and trying to find the word 'we' in the name itself! So what you'll need to do is open the file using the filename
and then read the lines.
One possible way to approach this is to read all the files into a dictionary and then go through that dict with counts of your word. Maybe like this:
import glob
import pandas as pd
articles = glob.glob('*.txt')
txt_files = {}
word = 'cupcakes'
for article in articles:
with open(article, 'r') as file:
txt_files[article] = file.read().splitlines().count(word)
my_word = pd.DataFrame.from_dict(txt_files, orient='index')
Upvotes: 1