user10083444
user10083444

Reputation: 105

How to read pdf files one by one from a folder in python

I am reading pdf files and trying to extract keywords from them through NLP techniques.Right now the program accepts one pdf at a time. I have a folder say in D drive named 'pdf_docs'. The folder contains many pdf documents. My goal is to read each pdf file one by one from the folder. How can I do that in python. The code so far working successfully is like below.

import PyPDF2

file = open('abc.pdf','rb')


fileReader = PyPDF2.PdfFileReader(file)

count = 0

while count < 3:

    pageObj = fileReader.getPage(count)
    count +=1
    text = pageObj.extractText()

Upvotes: 1

Views: 19818

Answers (3)

sudhagar narayanan
sudhagar narayanan

Reputation: 11

import PyPDF2
import re
import glob

#your full path of directory
mypath = "dir"
for file in glob.glob(mypath + "/*.pdf"):
    print(file)
    if file.endswith('.pdf'):
        fileReader = PyPDF2.PdfFileReader(open(file, "rb"))
        count = 0
        count = fileReader.numPages
        while count >= 0:
            count -= 1
            pageObj = fileReader.getPage(count)
            text = pageObj.extractText()
            print(text)
        num = re.findall(r'[0-9]+', text)
        print(num)
    else:
        print("not in format")

Let's go through the code: In python we can't handle Pdf files normally. so we need to install PyPDF2 package then import the package. "glob" function is used to read the files inside the directory. using "for" loop to get the files inside the folder. now check the file type is it in pdf format or not by using "if" condition. now we are reading the pdf files in the folder using "PdfFileReader"function. then getting number of pages in the pdf document. By using while loop to getting all pages and print all text in the file.

Upvotes: 1

olisch
olisch

Reputation: 990

you can use glob in order use pattern matching for getting a list of all pdf files in your directory.

import glob

pdf_dir = "/foo/dir"

pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
for file in pdf_files:
    do_your_stuff()

Upvotes: 1

Raoslaw Szamszur
Raoslaw Szamszur

Reputation: 1740

First read all files that are available under that directory

from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]

And then run your code for each file in that list

import PyPDF2
from os import listdir
from os.path import isfile, join


onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in onlyfiles:
    fileReader = PyPDF2.PdfFileReader(open(file,'rb'))

    count = 0

    while count < 3:

        pageObj = fileReader.getPage(count)
        count +=1
        text = pageObj.extractText()

os.listdir() will get you everything that's in a directory - files and directories. So be careful to have only pdf files in your path or you will need to implement simple filtration for list.

Edit 1

You can also use glob module, as it does pattern matching.

>>> import glob
>>> print(glob.glob('/home/rszamszur/*.sh'))
['/home/rszamszur/work-monitors.sh', '/home/rszamszur/default-monitor.sh', '/home/rszamszur/home-monitors.sh']

Key difference between OS module and glob is that OS will work for all systems, where glob only for Unix like.

Upvotes: 1

Related Questions