Reputation: 49
My goal is to print the count of a word occurrences in a list of files but the problem is that my code considers the occurrence as 1 even if the word exists in a line more than once
ex : like like like like
the output is 1 not 4.
import os
import math
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords= set(stopwords.words('english'))
folderpath = "C:\\Users\\user\\Desktop\\Documents"
word = input("Choose a word : ")
for(path, dirs, files) in os.walk(folderpath, topdown=True):
for file in files:
counter = 0
idf = 0
filepath = os.path.join(path, file)
with open(filepath, 'r') as f:
info = f.readlines()
for line in f:
if word in str(info).casefold() and word not in stopwords:
for line in info:
if word in line:
counter=counter+1
idf = 1 + math.log10(counter)
weight = idf * counter
print("The tf in" + " " + os.path.splitext(file)[0] + " "+ "is :" + " " + " " + str(counter))
print ("The idf is" + ":" + " "+ str(idf))
print("The weight is"+":" + " " + str(weight))
print(" ")
the results are :
the document's name and the term-frequency
then the inverse-document-frequency
them the weight
but I expected the same result except :
the term-frequency "which is the counter of the occurrences" has to be the number of word's occurrences in the file but actually it is the number of the word's occurrences in each lines as the following : add 1 to the counter if the word in the line regardless of the number of the occurrences
Upvotes: 0
Views: 270
Reputation: 154
I think you are having issues because of:
if word in str(info).casefold() and word not in stopwords:
for line in info:
if word in line:
counter=counter+1
idf = 1 + math.log10(counter)
This is only adding 1 to your "counter" for each line that has a match.
I think you would be much better off using re.findall on each line and then counting the result of re.findall into your "counter"
Please see my code below, although it is not a full solution I think you can see how it could be inserted into your code.
import re
Mylist = ("like like like like like like", "right ike left like herp derp") # This is in place of your files.
word = "like" # word to look for
counter = 0
for i in Mylist: # in your code this would be "for line in f:"
search = re.findall(word, i) # use re.findall to search for all instances of your word in given line.
for i in search: # then for every word returned by re.findall in that line count them into your counter.
counter = counter + 1
print(counter)
This code returns,
7
There is further optimisation, as you are using re.findall you do not need to read your file line by line, you can look at the whole file at once like this.
with open(filepath, 'r') as f:
info = f.read()
search = re.findall(word, info)
for i in search:
counter = counter + 1
This should return the same and have one less layer in your loop.
Upvotes: 3