iText java not parsing text properly from PDF/

Question

I am using iText Java API to extract text from a PDF.

String text =  PdfTextExtractor.getTextFromPage(reader,i);

Src PDF content:

1.2 SUBMITTALS

Generated Text:

SUBMITTALS
1.2

Extracted Text is split into 2 separate lines and order of the text is also messed up.

Can someone please help me understand what am I doing wrong?

Src pdf file link - https://www.dropbox.com/s/vc9it3c7856ejli/testPDF.pdf?dl=0

Target text file generated from iText - https://www.dropbox.com/s/ps2l9yz5ufuup01/test.txt?dl=0

But when I test with other PDF APIs like PDFClown, OCROnline it is working as expected.

Please help

Thanks

mkl · Accepted Answer

The cause

iText with its standard text extraction strategy extracts

as

SUBMITTALS
1.2

because the "1.2" actually is located (minutely) below the "SUBMITTALS":

q .75000 0 0 .75000 0 792 cm 
1 1 1 rg 0 0 816 -1056 re f 
q .32000 0 0 .32000 0 0 cm 
q 
...
q .20823 0 0 .20807 0 0 cm 
BT /F2 220 Tf 0 g 2340 -6628 Td(SUBMITTALS) Tj ET Q
q .20823 0 0 .20807 0 0 cm 
BT /F2 220 Tf 0 g 1440 -6634 Td(1.2) Tj ET Q

As you can see in this excerpt of the content drawing instructions from the PDF, the "1.2" is drawn at the scaled y coordinate -6634 while "SUBMITTALS" is drawn at -6628, i.e. "1.2" is drawn 6 scaled units below "SUBMITTALS".

This makes iText put it onto a separate following line.

A solution

You can use the HorizontalTextExtractionStrategy2 from this answer instead of the default extraction strategy, cf. TextExtraction.java test testTestPDF, and get this output:

1.2 SUBMITTALS

(For details on the use of that strategy, confer the answer mentioned above. HorizontalTextExtractionStrategy2 is the updated strategy from the section "UPDATE: Changes in LocationTextExtractionStrategy" of that answer.)

iText java not parsing text properly from PDF/

Answers (1)

The cause

A solution

Related Questions