PDFMiner Extraction for Single Words - LTText LTTextBox

Question

I am generating word x,y coordinates with PDFMiner in the below example, however the results are on a line by line basis. How can I split each word from another word, rather splitting groups of words line by line (see example below). I have tried several of the arguments in the PDFMiner tutorial. LTTextBox and LTText were both tried. Moreover, I cannot use beginning and end offsets normally used in text analytics.

This PDF is a good example, this is used in the code below.

http://www.africau.edu/images/default/sample.pdf

from pdfminer.layout import LAParams, LTTextBox, LTText
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.converter import PDFPageAggregator

#Imports Searchable PDFs and prints x,y coordinates
fp = open('C:\sample.pdf', 'rb')
manager = PDFResourceManager()
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev)
pages = PDFPage.get_pages(fp)

for page in pages:
    print('--- Processing ---')
    interpreter.process_page(page)
    layout = dev.get_result()
    for lobj in layout:
        if isinstance(lobj, LTText):
            x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
            print('At %r is text: %s' % ((x, y), text))

This returns the x,y coordinates for the searchable PDF as demonstrated below:

--- Processing ---
At (57.375, 747.903) is text: A Simple PDF File
At (69.25, 698.098) is text: This is a small demonstration .pdf file -
At (69.25, 674.194) is text: just for use in the Virtual Mechanics tutorials. More text. And more 
 text. And more text. And more text. And more text.

Wanted result (the coordinates are proxy for demonstration):

--- Processing ---
At (57.375, 747.903) is text: A
At (69.25, 698.098) is text: Simple
At (69.25, 674.194) is text: PDF
At (69.25, 638.338) is text: File

miwin · Accepted Answer

With PDFMiner, after going through each line (as you already did), you may only go through each character in the line.

I did this with the code below, while trying to record the x, y of the first character per word and setting up a condition to split the words at each LTAnno (e.g. ) or .get_text() == ' ' empty space.

from pdfminer.layout import LAParams, LTTextBox, LTText, LTChar, LTAnno
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.converter import PDFPageAggregator

#Imports Searchable PDFs and prints x,y coordinates
fp = open('C:\sample.pdf', 'rb')
manager = PDFResourceManager()
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev)
pages = PDFPage.get_pages(fp)

for page in pages:
    print('--- Processing ---')
    interpreter.process_page(page)
    layout = dev.get_result()
    x, y, text = -1, -1, ''
    for textbox in layout:
        if isinstance(textbox, LTText):
          for line in textbox:
            for char in line:
              # If the char is a line-break or an empty space, the word is complete
              if isinstance(char, LTAnno) or char.get_text() == ' ':
                if x != -1:
                  print('At %r is text: %s' % ((x, y), text))
                x, y, text = -1, -1, ''     
              elif isinstance(char, LTChar):
                text += char.get_text()
                if x == -1:
                  x, y, = char.bbox[0], char.bbox[3]    
    # If the last symbol in the PDF was neither an empty space nor a LTAnno, print the word here
    if x != -1:
      print('At %r is text: %s' % ((x, y), text))

The output looks as follows

At (64.881, 747.903) is text: A
At (90.396, 747.903) is text: Simple
At (180.414, 747.903) is text: PDF
At (241.92, 747.903) is text: File

Perhaps you can optimize the conditions to detect the words for your requirements and liking. (e.g. cut punctuation marks .!? at the end of words)

PDFMiner Extraction for Single Words - LTText LTTextBox

Answers (2)

Related Questions