merlin
merlin

Reputation: 61

Extract the coordinates of each word from PDF file using pdfminer

I am trying to extract the coordinates of each word from the input PDF file using pdfminer. I have tried the below code.

from pdfminer.layout import LAParams, LTTextBox, LTText, LTChar, LTAnno
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.converter import PDFPageAggregator

fp = open('Input.pdf', 'rb')
manager = PDFResourceManager()
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev)
pages = PDFPage.get_pages(fp)
for page in pages:
    interpreter.process_page(page)
    layout = dev.get_result()
    x, y, text = -1, -1, ''
    for textbox in layout:
        if isinstance(textbox, LTText):
          for line in textbox:
            for char in line:
              # If the char is a line-break or an empty space, the word is complete
              if isinstance(char, LTAnno) or char.get_text() == ' ':
                if x != -1:
                    print('%r : %s' % ((x, y), text))
                x, y, text = -1, -1, ''
              elif isinstance(char, LTChar):
                text += char.get_text()
                if x == -1:
                  x, y, = char.bbox[0], char.bbox[3]
    # If the last symbol in the PDF was neither an empty space nor a LTAnno, print the word here
    if x != -1:
      print('At %r : %s' % ((x, y), text))

I could extract the coordinates of words from the first page of the input file. After that I am getting an error like this when running the above code:

TypeError                                 Traceback (most recent call last)
<ipython-input-154-a00e7d332dc4> in <module>
     19         if isinstance(textbox, LTText):
     20           for line in textbox:
---> 21             for char in line:
     22               # If the char is a line-break or an empty space, the word is complete
     23               if isinstance(char, LTAnno) or char.get_text() == ' ':

TypeError: 'LTChar' object is not iterable

My question is:

  1. Why is the error occuring?
  2. My input PDF has 24 pages. So how to extract coordinates of words from all the pages?

Upvotes: 2

Views: 727

Answers (1)

junsuzuki
junsuzuki

Reputation: 140

  1. As Zach Young commented, I would make sure line on line 21 is not an LTChar object with :

    if isinstance(line, LTTextLineHorizontal):
    
  2. You can append to a list the extracted coordinates for each page. I would do :

    all_coordinates = []       
    
    fp = open('Input.pdf', 'rb')
    manager = PDFResourceManager()
    laparams = LAParams()
    dev = PDFPageAggregator(manager, laparams=laparams)
    interpreter = PDFPageInterpreter(manager, dev)
    pages = PDFPage.get_pages(fp)
    
    for page in pages:
    
        page_coordinates = []
    
        interpreter.process_page(page)
        layout = dev.get_result()
        x, y, text = -1, -1, ''
        for textbox in layout:
            if isinstance(textbox, LTTextBox):
                for line in textbox:
                    if isinstance(line, LTTextLineHorizontal):
                        for char in line:
                           if isinstance(char, LTAnno) or char.get_text() == ' ':
                               if x != -1:
                                   print('%r : %s' % ((x, y), text))
                                   x, y, text = -1, -1, ''
                           elif isinstance(char, LTChar):
                               text += char.get_text()
                               if x == -1:
                                   x, y, = char.bbox[0], char.bbox[3]
                                   page_coordinates.append((x,y))
        if x != -1:
            print('At %r : %s' % ((x, y), text))
    
        all_coordinates.append(page_coordinates)
    

Upvotes: 1

Related Questions