Reputation: 1
If there are 2 columns on a page of a pdf file, does pdfbox parse it column-wise or line by line?
Upvotes: 0
Views: 197
Reputation: 95928
When you say "parse a PDF file", I assume you mean apply text extraction to it. In that case of PDFBox that would usually be by means of the PDFTextStripper
class.
In that case the answer is "it depends".
By default the PDFTextStripper
extracts in the order of the text drawing instructions in the content streams. Quite often this corresponds to the logical order of content because PDF generators usually have input arranged in that order and generate output accordingly, e.g. multiple columns often will be extracted column-wise.
BUT there is no guarantee for that; the text drawing instructions may theoretically be in any order; for example first all 'a's on a page may be drawn, then all 'b's, ...
Such chaos is very seldom. But if the content of a PDF contains both fixed and dynamic contents, you might often first see the fixed, then the dynamic contents in the extracted text. For example first labels like "First Name", "Last Name", "Date of Birth", ... and then the values.
Alternatively you can use PDFTextStripper.setSortByPosition
to set the SortByPosition
property to true
. In this case PDFBox ignores the order of drawing and attempts to extract the PDF text line-by-line.
Upvotes: 2