Reputation: 61
I am trying to extract the coordinates of each word from the input PDF file using pdfminer. I have tried the below code.
from pdfminer.layout import LAParams, LTTextBox, LTText, LTChar, LTAnno
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.converter import PDFPageAggregator
fp = open('Input.pdf', 'rb')
manager = PDFResourceManager()
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev)
pages = PDFPage.get_pages(fp)
for page in pages:
interpreter.process_page(page)
layout = dev.get_result()
x, y, text = -1, -1, ''
for textbox in layout:
if isinstance(textbox, LTText):
for line in textbox:
for char in line:
# If the char is a line-break or an empty space, the word is complete
if isinstance(char, LTAnno) or char.get_text() == ' ':
if x != -1:
print('%r : %s' % ((x, y), text))
x, y, text = -1, -1, ''
elif isinstance(char, LTChar):
text += char.get_text()
if x == -1:
x, y, = char.bbox[0], char.bbox[3]
# If the last symbol in the PDF was neither an empty space nor a LTAnno, print the word here
if x != -1:
print('At %r : %s' % ((x, y), text))
I could extract the coordinates of words from the first page of the input file. After that I am getting an error like this when running the above code:
TypeError Traceback (most recent call last)
<ipython-input-154-a00e7d332dc4> in <module>
19 if isinstance(textbox, LTText):
20 for line in textbox:
---> 21 for char in line:
22 # If the char is a line-break or an empty space, the word is complete
23 if isinstance(char, LTAnno) or char.get_text() == ' ':
TypeError: 'LTChar' object is not iterable
My question is:
Upvotes: 2
Views: 727
Reputation: 140
As Zach Young commented, I would make sure line
on line 21 is not an LTChar object with :
if isinstance(line, LTTextLineHorizontal):
You can append to a list the extracted coordinates for each page. I would do :
all_coordinates = []
fp = open('Input.pdf', 'rb')
manager = PDFResourceManager()
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev)
pages = PDFPage.get_pages(fp)
for page in pages:
page_coordinates = []
interpreter.process_page(page)
layout = dev.get_result()
x, y, text = -1, -1, ''
for textbox in layout:
if isinstance(textbox, LTTextBox):
for line in textbox:
if isinstance(line, LTTextLineHorizontal):
for char in line:
if isinstance(char, LTAnno) or char.get_text() == ' ':
if x != -1:
print('%r : %s' % ((x, y), text))
x, y, text = -1, -1, ''
elif isinstance(char, LTChar):
text += char.get_text()
if x == -1:
x, y, = char.bbox[0], char.bbox[3]
page_coordinates.append((x,y))
if x != -1:
print('At %r : %s' % ((x, y), text))
all_coordinates.append(page_coordinates)
Upvotes: 1