Having a problem counting word occurrences in a collection of text documents

Question

I have a large collection of text documents that I want to loop through and output a count of particular words in these documents in a simple dataframe of article title and word occurrence count. However my output dataframe is clearly incorrect. I suspect I'm doing something silly with the code. Would anyone be able to help identify the problem?

I collect the articles using the glob package and then loop through them using a count function. However my output is giving me patently wrong answers such as counts of '1' or '0' for the occurrences of simple things like the word "we" in very large documents.

import glob

articles = glob.glob('Test/*.txt')

we_dict = {}

for article in articles:
    we_dict[article] = article.count("we")

we = pd.DataFrame.from_dict(we_dict, orient='index', dtype=None)

There's no error messages produced, so the code is doing something - a dataframe is produced. But the count values outputted should be in the hundreds instead of small numbers such as 0, 1, 2.

EDIT:

Working version for future readers with the same query thanks to the very helpful responses. I'm sure the code could be simplified somewhat.

import glob
import re

articles = glob.glob('Test/*.txt')

we_dict = {}

for article in articles:
    with open(article, 'r', encoding="utf8") as art:
        a = art.read()
        a = a.lower()
        we_dict[article] = sum(1 for match in re.finditer(r"\bwe\b", a))

we = pd.DataFrame.from_dict(we_dict, orient='index', dtype=None)

Having a problem counting word occurrences in a collection of text documents

Answers (1)

Related Questions