Biraj Man Singh
Biraj Man Singh

Reputation: 1

In what order does pdfbox parse a pdf file? (suppose there are more than 1 column in a page)

If there are 2 columns on a page of a pdf file, does pdfbox parse it column-wise or line by line?

Upvotes: 0

Views: 197

Answers (1)

mkl
mkl

Reputation: 95928

When you say "parse a PDF file", I assume you mean apply text extraction to it. In that case of PDFBox that would usually be by means of the PDFTextStripper class.

In that case the answer is "it depends".

  • By default the PDFTextStripper extracts in the order of the text drawing instructions in the content streams. Quite often this corresponds to the logical order of content because PDF generators usually have input arranged in that order and generate output accordingly, e.g. multiple columns often will be extracted column-wise.

    BUT there is no guarantee for that; the text drawing instructions may theoretically be in any order; for example first all 'a's on a page may be drawn, then all 'b's, ...

    Such chaos is very seldom. But if the content of a PDF contains both fixed and dynamic contents, you might often first see the fixed, then the dynamic contents in the extracted text. For example first labels like "First Name", "Last Name", "Date of Birth", ... and then the values.

  • Alternatively you can use PDFTextStripper.setSortByPosition to set the SortByPosition property to true. In this case PDFBox ignores the order of drawing and attempts to extract the PDF text line-by-line.

Upvotes: 2

Related Questions