Nabil Jaroush
Nabil Jaroush

Reputation: 49

how to convert multiple files from pdf to text files using python

i have a python script that convert pdf file to text file. the system ask the user to the path of the folder that contains the PDF files.

the problem is that the script just convert one file , what need is to make the script convert all the PDF files that exist in the specified directory.

the script list all the existing files in the specified directory but it convert all the files excluding last file

result after increment the i

enter image description here

code:

import os
from os import chdir, getcwd, listdir, path
import codecs
import pyPdf
from time import strftime

def check_path(prompt):
    ''' (str) -> str
    Verifies if the provided absolute path does exist.
    '''
    abs_path = raw_input(prompt)
    while path.exists(abs_path) != True:
        print "\nThe specified path does not exist.\n"
        abs_path = raw_input(prompt)
    return abs_path    

print "\n"

folder = check_path("Provide absolute path for the folder: ")

list=[]
directory=folder
for root,dirs,files in os.walk(directory):
    for filename in files:
        if filename.endswith('.pdf'):
            t=os.path.join(directory,filename)
            list.append(t)

m=len(list)
i=0
while i<=len(list):

    path=list[i]
    head,tail=os.path.split(path)
    var="\\"

    tail=tail.replace(".pdf",".txt")
    name=head+var+tail



    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for j in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(j).extractText() + "\n"
    print strftime("%H:%M:%S"), " pdf  -> txt "
    f=open(name,'w')
    f.write(content.encode('UTF-8'))
    f.close
    i+=1

Upvotes: 0

Views: 3655

Answers (3)

WCTech
WCTech

Reputation: 174

You created a while loop, but that while loop will run forever because you did not update the i value after the loop executed

Just put i+=1 At the bottom of your while loop and change your for loop to

for x in range(0, pdf.getNumPages()):
    # Extract text from page and add to content
    content += pdf.getPage(x).extractText() + "\n"

The i of the for loop is interfering with the while loop

Upvotes: 0

Vineet Bhat
Vineet Bhat

Reputation: 148

Apart from no increment of variable i of while loop, you are also using the same variable name i in the for loop. So, after leaving the for loop the value of the variable i has already changed. You should use separate variable names in while and for loop.

Upvotes: 1

bigbounty
bigbounty

Reputation: 17368

You missed out on incrementing the variable i.

There is a simple way of doing this in python.

Download and install PDFMiner.

Then use subprocess module to do the job.

import subprocess

files = [
    'file1.pdf', 'file2.pdf', 'file3.pdf'
]
for f in files:
    cmd = 'python pdf2txt.py -o %s.txt %s' % (f.split('.')[0], f)
    run = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    out, err = run.communicate()
# display errors if they occur    
if err:
    print err

Upvotes: 1

Related Questions