Algorithms to Extract Text From a PDF (re-flowing text layout from a jumble of words)

Question

PDFs are made up of many individual text objects, which contain an X and Y coordinate, and a string. Often, these objects are put in the order they appear in the document, so extracting document text is as simple as reading the text objects in the order they appear in the PDF stream.

However, many PDFs do not play so nice.

The PDF spec does not require that the text be ordered in any way within the PDF stream. It is not uncommon to see PDFs where the end of the PDF is at the start of the stream, the middle is at the end, and the start is in the middle. In the extreme case, the stream is a jumble of text boxes in no order.

What algorithms exist for determining the proper text flow of the objects in the PDF stream?

For simple documents, ordering the text isn't too hard: You order the objects top to bottom and left to right, and then extract the text from the most top left text boxes, working your way down. However, documents often have multiple columns, titles, subheadings, headers, footers, tabbed paragraphs, etc. Are there any solutions that are robust in many different situations?

For clarity, below is an example of a function prototype that I am trying to implement.

def sort_according_to_text_flow(objs, page_width, page_height):
   # objs               A list of objects where each object is a dict containing:
   #     x, y           The top-left corner position of the text box
   #     width, height  The width and height of the text box
   #     text           A string of the text
   # page_width/height  Width/height of the page
   # returns the list of objects, ordered for natural reading

Lets assume for the moment, that we're only dealing with left-to-right languages.

Algorithms to Extract Text From a PDF (re-flowing text layout from a jumble of words)

Answers (1)

Related Questions