Reputation: 70
I recently found this really handy library for pdf conversion. I am trying to convert a pdf to string values. In order to parse the data and convert to a csv file. I want to automate this for future so I cannot use Tabula.
I am calling some modules in order to convert pdf to string.
The part for string conversion is not working. (pdf2string.py
)
Here is part for the pdf conversion to string.
I get no error. Success. But, there is no output.
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
import re
import csv
import sys
def convert_pdf_to_html(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = HTMLConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0 #is for all
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
print str
if __name__ == '__main__':
if len(sys.argv) == 2:
path = sys.argv[1]
convert_pdf_to_html(path)
This is my bash.
python pdf2string.py example.pdf
Script is pdf2string.py
and path is example.pdf
.
I am also new to high-level logic in python.
Upvotes: 0
Views: 66
Reputation: 262
Edit: you are returning before printing - remove return str
, or remove print str
and use the advice below.
You're not printing the output of convert_pdf_to_html(), or saving it somewhere.
print convert_pdf_to_html(path)
Upvotes: 2