Reputation: 2626
I am writing a script for uploading PDF files and parsing them in the process. For the parsing i use PDFminer.
For turning the file into a PDFMiner document, i use the following function, neatly following the instructions you can find in the link above:
def load_document(self, _file = None):
"""turn the file into a PDFMiner document"""
if _file == None:
_file = self.options['file']
parser = PDFParser(_file)
doc = PDFDocument()
doc.set_parser(parser)
if self.options['password']:
password = self.options['password']
else:
password = ""
doc.initialize(password)
if not doc.is_extractable:
raise ValueError("PDF text extraction not allowed")
return doc
The expected result is of course a nice PDFDocument
instance, but instead i get an error:
Traceback (most recent call last):
File "bzk_pdf.py", line 45, in <module>
cli.run_cli(BZKPDFScraper)
File "/home/toon/Projects/amcat/amcat/scripts/tools/cli.py", line 61, in run_cli
instance = cls(options)
File "/home/toon/Projects/amcat/amcat/scraping/pdf.py", line 44, in __init__
self.doc = self.load_document()
File "/home/toon/Projects/amcat/amcat/scraping/pdf.py", line 56, in load_document
doc.set_parser(parser)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfparser.py", line 327, in set_parser
self.info.append(dict_value(trailer['Info']))
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 132, in dict_value
x = resolve1(x)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 60, in resolve1
x = x.resolve()
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 49, in resolve
return self.doc.getobj(self.objid)
AttributeError: 'NoneType' object has no attribute 'getobj'
I have no idea where to look, and i have not found anyone else with the same problem.
Some extra info that might help:
_file
is a django File object, but using normal files has the same resultUpvotes: 0
Views: 3937
Reputation: 2626
With some experimenting i have found that i was missing a line:
parser.set_document(doc)
Having added that line, the function now works.
Looks like poor library design to me, but it might be that i've missed something and this just patches up the error.
Anyhow, i've got a PDF document now with the data i need.
Here's the end result:
def load_document(self, _file = None):
"""turn the file into a PDFMiner document"""
if _file == None:
_file = self.options['file']
parser = PDFParser(_file)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
if 'password' in self.options.keys():
password = self.options['password']
else:
password = ""
doc.initialize(password)
if not doc.is_extractable:
raise ValueError("PDF text extraction not allowed")
return doc
Upvotes: 2
Reputation: 174624
Try opening the file and sending it to the parser, like this:
with open(_file,'rb') as f:
parser = PDFParser(f)
# your normal code here
The way you are doing it now, I suspect you are sending the filename as a string.
Upvotes: 0