Reputation: 31
I am using iText Java API to extract text from a PDF.
String text = PdfTextExtractor.getTextFromPage(reader,i);
Src PDF content:
1.2 SUBMITTALS
Generated Text:
SUBMITTALS
1.2
Extracted Text is split into 2 separate lines and order of the text is also messed up.
Can someone please help me understand what am I doing wrong?
Src pdf file link - https://www.dropbox.com/s/vc9it3c7856ejli/testPDF.pdf?dl=0
Target text file generated from iText - https://www.dropbox.com/s/ps2l9yz5ufuup01/test.txt?dl=0
But when I test with other PDF APIs like PDFClown, OCROnline it is working as expected.
Please help
Thanks
Upvotes: 0
Views: 625
Reputation: 96064
iText with its standard text extraction strategy extracts
as
SUBMITTALS
1.2
because the "1.2" actually is located (minutely) below the "SUBMITTALS":
q .75000 0 0 .75000 0 792 cm
1 1 1 rg 0 0 816 -1056 re f
q .32000 0 0 .32000 0 0 cm
q
...
q .20823 0 0 .20807 0 0 cm
BT /F2 220 Tf 0 g 2340 -6628 Td(SUBMITTALS) Tj ET Q
q .20823 0 0 .20807 0 0 cm
BT /F2 220 Tf 0 g 1440 -6634 Td(1.2) Tj ET Q
As you can see in this excerpt of the content drawing instructions from the PDF, the "1.2" is drawn at the scaled y coordinate -6634 while "SUBMITTALS" is drawn at -6628, i.e. "1.2" is drawn 6 scaled units below "SUBMITTALS".
This makes iText put it onto a separate following line.
You can use the HorizontalTextExtractionStrategy2
from this answer instead of the default extraction strategy, cf. TextExtraction.java test testTestPDF
, and get this output:
1.2 SUBMITTALS
(For details on the use of that strategy, confer the answer mentioned above. HorizontalTextExtractionStrategy2
is the updated strategy from the section "UPDATE: Changes in LocationTextExtractionStrategy" of that answer.)
Upvotes: 2