Reputation: 157
I am reading the text from a page of a pdf document using iText. There are two exactly same lines in PDF, but the output after parsing is different for both lines. What could be the reason for the iText lib to spit out the text differently? Length of both lines (strings) are same.
iText methods used:
String text = PdfTextExtractor.getTextFromPage(reader, 1);
When I inspect 'text' element, the output is as below. However, these three lines seem to be exactly identical in the pdf.
XXXXXX XXXXX XXXXX : XXXXX :
#*2 1
XXXXXX XXXXX XXXXX : #*3 XXXXX : 2
XXXXXX XXXXX XXXXX : XXXXX :
#15 1
EDIT: Extra Question: When I used PDFBox, the parsed output is very different. Why is there a difference in the output text when using iText vs PDFBox?
Upvotes: 0
Views: 254
Reputation: 95918
While in the screenshot the rows look like they are they are at a constant level each,
they actually are not. The 'XXX...:' and 'TOTAL :' parts are at y coordinates 469.45, 457.95, and 446.45 while the '#..', '1', and '2' parts are at y coordinates 468.65, 457.15, and 445.65.
To consider horizontal text to be on the same line, iText text extraction using the default text extraction strategy (LocationTextExtractionStrategy
) requires the y coordinates to be the same after casting to int
. (Actually this is somewhat simplified, for the whole picture look at LocationTextExtractionStrategy.TextChunkLocationDefaultImp
)
In the case at hand this only is the case for the middle row, (int) 457.95 = 457 = (int) 457.15
. Thus, default text extraction results in:
XXXXXX XXXXX XXXXX : TOTAL :
#*2 1
XXXXXX XXXXX XXXXX : #*3 TOTAL : 2
XXXXXX XXXXX XXXXX: TOTAL :
#15 1
In such situations you need a text extraction strategy which recognizes lines differently. If you e.g. use the HorizontalTextExtractionStrategy
or HorizontalTextExtractionStrategy2
(depending on your iText version, the former one for up to iText 5.5.8, the latter one for newer iText 5.5.x versions) from this answer, you'll get:
XXXXXX XXXXX XXXXX : #*2 TOTAL : 1
XXXXXX XXXXX XXXXX : #*3 TOTAL : 2
XXXXXX XXXXX XXXXX: #15 TOTAL : 1
(Tested using TextExtraction.java test method testTest_pdf()
)
By the way, this does not mean that one should switch to HorizontalTextExtractionStrategy2
by default. This method has its disadvantages, too, in particular it looks at the the whole page (or at least the whole page section if extracting by filter) width to find lines. Thus, if your page e.g. has two columns of text nect to each other and lines are at the same approximate height only per column, this strategy will likely return utter garbage.
The OP asked in a comment
Can you give me a brief explanation of what the
HorizontalTextExtractionStrategy
is doing?
While scanning the page, this strategy merely collects the text chunks from the text drawing instructions with their bounding box coordinates.
When asked for the resulting text, it in a first pass projects all these bounding boxes onto the y axis of the page coordinate system.
In the second pass it interprets each connected component of the image of this projection as the range of y coordinate of a single line: It iterates over these connected components top-to-bottom; for each component it takes all chunks projected into it, sorts them by their x coordinate, adds spaces where appropriate, and merges them to a text line.
Finally it returns the concatenation of these lines (with line feeds in-between).
LocationTextExtractionStrategy
says "This renderer keeps track of the orientation and distance (both perpendicular and parallel) to the unit vector of the orientation. Text is ordered by orientation, then perpendicular, then parallel distance. Text with the same perpendicular distance, but different parallel distance is treated as being on the same line." That does not make a lot of sense to me.
Essentially it also is a two-pass strategy, in a first pass collecting all text chunks with coordinates and in a second one arranging them as lines. This strategy, though, takes the orientation of the baseline of the chunks into account and first sorts by the angle of baseline.
Among the chunks with the same baseline angle it considers chunks to belong to the same text line if their (bounded) baselines are on the same (unbounded) line.
The chunks considered to belong to the same text line then are sorted in the direction of the writing orientation and spaces are inserted where appropriate.
The comparisons made by this strategy all are based on int
values and so allow for a tiny bit of variance
Upvotes: 2