speedplane
speedplane

Reputation: 16121

Algorithms to Extract Text From a PDF (re-flowing text layout from a jumble of words)

PDFs are made up of many individual text objects, which contain an X and Y coordinate, and a string. Often, these objects are put in the order they appear in the document, so extracting document text is as simple as reading the text objects in the order they appear in the PDF stream.

However, many PDFs do not play so nice.

The PDF spec does not require that the text be ordered in any way within the PDF stream. It is not uncommon to see PDFs where the end of the PDF is at the start of the stream, the middle is at the end, and the start is in the middle. In the extreme case, the stream is a jumble of text boxes in no order.

What algorithms exist for determining the proper text flow of the objects in the PDF stream?

For simple documents, ordering the text isn't too hard: You order the objects top to bottom and left to right, and then extract the text from the most top left text boxes, working your way down. However, documents often have multiple columns, titles, subheadings, headers, footers, tabbed paragraphs, etc. Are there any solutions that are robust in many different situations?

For clarity, below is an example of a function prototype that I am trying to implement.

def sort_according_to_text_flow(objs, page_width, page_height):
   # objs               A list of objects where each object is a dict containing:
   #     x, y           The top-left corner position of the text box
   #     width, height  The width and height of the text box
   #     text           A string of the text
   # page_width/height  Width/height of the page
   # returns the list of objects, ordered for natural reading

Lets assume for the moment, that we're only dealing with left-to-right languages.

Upvotes: 4

Views: 2806

Answers (1)

speedplane
speedplane

Reputation: 16121

This problem appears to be the subject of quite a bit of research, typically referred to as "top down document layout analysis." One research paper summarizes quite a few different algorithms:

Typical top-down approaches proceed by dividing a document image into smaller regions using the horizontal and vertical projection profiles. The X-Y Cut algorithm [13], starts dividing a document image into sections based on valleys in their projection profiles (see figure 7). The algorithm repeatedly partitions the document by alternately projecting the regions of the current segmentation on the horizontal and vertical axes. The splitting is stopped when a particular criterion that decides the atomicity of a region is met. Projection profile based techniques are extremely sensitive to the skew of the document. A small amount of skew could completely change the nature of the projection, resulting in no clear valleys. Hence extreme care has to be taken while scanning of documents, or a reliable skew correction algorithm has to be applied before the segmentation process.

Efficient and accurate methods have been devised for documents with white background and a manhattan (rectangular) layout of the regions, based on geometric approaches that cover the background. Examples include the shape-directed cover algorithm by Baird et al. [17] and the white streams based segmentation by Pavlidis and Zhou [20]. These are top-down approaches that divide a document image into smaller regions based on the structure of the background. The primary idea is to detect a set of maximal rectangles that do not contain any foreground pixels. These rectangles divide the document image into regions that may contain text, images or graphics. A region classification algorithm could be applied to them before further processing. The algorithms mentioned above are very complex to implement due to the representation of intermediate results and many special cases that need to be handled. The whitespace cover algorithm by Breuel [19] is an efficient variation of the same idea using geometric principles. The algorithm is simpler due to the lack of special cases and performs very well for this class of documents.

Upvotes: 4

Related Questions