Reputation: 1
y'all. I've been trying to remove stopwords from a list that a pdf has been read to, but whenever I use nltk to remove those stopwords from the list or from a new list, it returns the original list back to me in the TXT file. I have made a separate program just to test if the stopwords function even works, and it works fine there but for some reason not in this case.
Is there also a better method to do this? Any help would be much appreciated.
import PyPDF2 as pdf
import nltk
from nltk.corpus import stopwords
stopping_words = set(stopwords.words('english'))
stop_words = list(stopping_words)
# creating an object
file = open("C:\\Users\\Name\\Documents\\Data Analytics Club\\SampleBook-English2-Reading.pdf", "rb")
# creating a pdf reader object
fileReader = pdf.PdfFileReader(file)
# print the number of pages in pdf file
textData = []
for pages in fileReader.pages:
theText = pages.extractText()
# for char in theText:
# theText.replace(char, "\n")
textData.append(theText)
final_list = []
for i in textData:
if i in stopwords.words('english'):
textData.remove(i)
final_list.append(i.strip('\n'))
# filtered_word_list = final_list[:] #make a copy of the word_list
# for word in final_list: # iterate over word_list
# if word in stopwords.words('english'):
# final_list.remove(word) # remove word from filtered_word_list if it is a stopword
# filtered_words = [word for word in final_list if word not in stop_words]
# [s.strip('\n') for s in theText]
# [s.replace('\n', '') for s in theText]
# text_data = []
# for elem in textData:
# text_data.extend(elem.strip().split('n'))
# for line in textData:
# textData.append(line.strip().split('\n'))
#--------------------------------------------------------------------
import os.path
save_path = "C:\\Users\\Name\\Documents\\Data Analytics Club"
name_of_file = input("What is the name of the file: ")
completeName = os.path.join(save_path, name_of_file + ".txt")
file1 = open(completeName, "w")
# file1.write(str(final_list))
for line in final_list:
file1.write(line)
file1.close()
Upvotes: 0
Views: 568
Reputation: 1314
The problem is in this line
if i in stopwords.words('english'):
textData.remove(i)
You are only removing a single occurrence of that word. If you read here it simply removes the first occurrence of the word.
What you probably want to do instead to remove it is:
Python 2
filter(lambda x: x != i, textData)
Python 3
list(filter(lambda x: x != i, textData))
EDIT
So I realized quite a bit late that you are actually iterating over the list that you are removing elements from. So, you would probably not want to do that. For more information, reference here
Instead, what you would want to do is:
for i in set(textData):
if i in stopwords.words('english'):
pass
else
final_list.append(i.strip('\n'))
EDIT 2
So apparently the issue comes from here and needs to be fixed to:
for pages in fileReader.pages:
theText = pages.extractText()
words = theText.splitlines()
textData.append(theText)
However, for the file I tested this against, it still gave issues with spacing and merged words in the same sentence. It gave me words such as 'sameuserwithinacertaintimeinterval(typicallysettoa'
and 'bedirectionaltocapturethefactthatonestorywasclicked'
That being said, the issue lies within the PyPDF2 class. You may wish to resort to another reader. Comment if it still doesn't help
Upvotes: 1