Reputation: 49
i have a python script that convert pdf file to text file. the system ask the user to the path of the folder that contains the PDF files.
the problem is that the script just convert one file , what need is to make the script convert all the PDF files that exist in the specified directory.
the script list all the existing files in the specified directory but it convert all the files excluding last file
result after increment the i
import os
from os import chdir, getcwd, listdir, path
import codecs
import pyPdf
from time import strftime
def check_path(prompt):
''' (str) -> str
Verifies if the provided absolute path does exist.
'''
abs_path = raw_input(prompt)
while path.exists(abs_path) != True:
print "\nThe specified path does not exist.\n"
abs_path = raw_input(prompt)
return abs_path
print "\n"
folder = check_path("Provide absolute path for the folder: ")
list=[]
directory=folder
for root,dirs,files in os.walk(directory):
for filename in files:
if filename.endswith('.pdf'):
t=os.path.join(directory,filename)
list.append(t)
m=len(list)
i=0
while i<=len(list):
path=list[i]
head,tail=os.path.split(path)
var="\\"
tail=tail.replace(".pdf",".txt")
name=head+var+tail
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for j in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(j).extractText() + "\n"
print strftime("%H:%M:%S"), " pdf -> txt "
f=open(name,'w')
f.write(content.encode('UTF-8'))
f.close
i+=1
Upvotes: 0
Views: 3655
Reputation: 174
You created a while loop, but that while loop will run forever because you did not update the i
value after the loop executed
Just put
i+=1
At the bottom of your while loop
and change your for loop to
for x in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(x).extractText() + "\n"
The i of the for loop is interfering with the while loop
Upvotes: 0
Reputation: 148
Apart from no increment of variable i
of while
loop, you are also using the same variable name i
in the for
loop. So, after leaving the for
loop the value of the variable i
has already changed. You should use separate variable names in while
and for
loop.
Upvotes: 1
Reputation: 17368
You missed out on incrementing the variable i.
There is a simple way of doing this in python.
Download and install PDFMiner.
Then use subprocess module to do the job.
import subprocess
files = [
'file1.pdf', 'file2.pdf', 'file3.pdf'
]
for f in files:
cmd = 'python pdf2txt.py -o %s.txt %s' % (f.split('.')[0], f)
run = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = run.communicate()
# display errors if they occur
if err:
print err
Upvotes: 1