Reputation: 105
I am reading pdf files and trying to extract keywords from them through NLP techniques.Right now the program accepts one pdf at a time. I have a folder say in D drive named 'pdf_docs'. The folder contains many pdf documents. My goal is to read each pdf file one by one from the folder. How can I do that in python. The code so far working successfully is like below.
import PyPDF2
file = open('abc.pdf','rb')
fileReader = PyPDF2.PdfFileReader(file)
count = 0
while count < 3:
pageObj = fileReader.getPage(count)
count +=1
text = pageObj.extractText()
Upvotes: 1
Views: 19818
Reputation: 11
import PyPDF2
import re
import glob
#your full path of directory
mypath = "dir"
for file in glob.glob(mypath + "/*.pdf"):
print(file)
if file.endswith('.pdf'):
fileReader = PyPDF2.PdfFileReader(open(file, "rb"))
count = 0
count = fileReader.numPages
while count >= 0:
count -= 1
pageObj = fileReader.getPage(count)
text = pageObj.extractText()
print(text)
num = re.findall(r'[0-9]+', text)
print(num)
else:
print("not in format")
Let's go through the code: In python we can't handle Pdf files normally. so we need to install PyPDF2 package then import the package. "glob" function is used to read the files inside the directory. using "for" loop to get the files inside the folder. now check the file type is it in pdf format or not by using "if" condition. now we are reading the pdf files in the folder using "PdfFileReader"function. then getting number of pages in the pdf document. By using while loop to getting all pages and print all text in the file.
Upvotes: 1
Reputation: 990
you can use glob in order use pattern matching for getting a list of all pdf files in your directory.
import glob
pdf_dir = "/foo/dir"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
for file in pdf_files:
do_your_stuff()
Upvotes: 1
Reputation: 1740
First read all files that are available under that directory
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
And then run your code for each file in that list
import PyPDF2
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in onlyfiles:
fileReader = PyPDF2.PdfFileReader(open(file,'rb'))
count = 0
while count < 3:
pageObj = fileReader.getPage(count)
count +=1
text = pageObj.extractText()
os.listdir() will get you everything that's in a directory - files and directories. So be careful to have only pdf files in your path or you will need to implement simple filtration for list.
You can also use glob module, as it does pattern matching.
>>> import glob
>>> print(glob.glob('/home/rszamszur/*.sh'))
['/home/rszamszur/work-monitors.sh', '/home/rszamszur/default-monitor.sh', '/home/rszamszur/home-monitors.sh']
Key difference between OS module and glob is that OS will work for all systems, where glob only for Unix like.
Upvotes: 1