Reputation: 73
I'm using iText 5.5.8 for Java. Following the default, straightforward text extraction procedures, i.e.
PdfTextExtractor.getTextFromPage(reader, pageNumber)
I was surprised to find several mistakes in the output, specifically all letter ds come out as os.
So how does text extraction in iText really work? Is it some kind of OCR?
I took a look under the hood, trying to grasp how TextExtractionStrategy
works, but I couldn't figure out much. SimpleTextExtractionStrategy
for example seems to just determine the presence of lines and spaces, whereas it's TextRenderInfo
that provides text by invoking some decode method on a GraphicsState
's font
field and that's as far as I could go without getting a major migraine.
So who's my man? Which class should I override or which parameter should I tweak to be able to tell iText "hey, you're reading all ds wrong!"
edit:
sample PDF can be found at http://www.fpozzi.com/stampastopper/download/ name of file is 0116_LR.pdf Sorry, can't share a direct link. This is some basic code for text extraction
import java.io.File;
import java.io.IOException;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
public class Import
{
public static void importFromPdf(final File pdfFile) throws IOException
{
PdfReader reader = new PdfReader(pdfFile.getAbsolutePath());
try
{
for (int i = 1; i <= reader.getNumberOfPages(); i++)
{
System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
System.out.println("----------------------------------");
}
}
catch (IOException e)
{
throw e;
}
finally
{
reader.close();
}
}
public static void main(String[] args)
{
try
{
importFromPdf(new File("0116_LR.pdf"));
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
edit after @blagae and @mkl answers
Before starting to fiddle with iText I have tried text extraction from Apache PDFBox (a project similar to iText I just discoreved) but it does have the same issue.
Understanding how these programs treat text is way beyond my dedication, so I have written a simple method to extract text from raw page content, that is whatever stands between BT and ET markers.
import java.io.File;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.itextpdf.text.io.RandomAccessSourceFactory;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.RandomAccessFileOrArray;
import com.itextpdf.text.pdf.parser.ContentByteUtils;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
public class Import
{
private final static Pattern actualWordPattern = Pattern.compile("\\((.*?)\\)");
public static void importFromPdf(final File pdfFile) throws IOException
{
PdfReader reader = new PdfReader(pdfFile.getAbsolutePath());
Matcher matcher;
String line, extractedText;
boolean anyMatchFound;
try
{
for (int i = 1; i <= 16; i++)
{
byte[] contentBytes = ContentByteUtils.getContentBytesForPage(reader, i);
RandomAccessFileOrArray raf = new RandomAccessFileOrArray(new RandomAccessSourceFactory().createSource(contentBytes));
while ((line = raf.readLine()) != null && !line.equals("BT"));
extractedText = "";
while ((line = raf.readLine()) != null && !line.equals("ET"))
{
anyMatchFound = false;
matcher = actualWordPattern.matcher(line);
while (matcher.find())
{
anyMatchFound = true;
extractedText += matcher.group(1);
}
if (anyMatchFound)
extractedText += "\n";
}
System.out.println(extractedText);
System.out.println("+++++++++++++++++++++++++++");
String properlyExtractedText = PdfTextExtractor.getTextFromPage(reader, i);
System.out.println(properlyExtractedText);
System.out.println("---------------------------");
}
}
catch (IOException e)
{
throw e;
}
finally
{
reader.close();
}
}
public static void main(String[] args)
{
try
{
importFromPdf(new File("0116_LR.pdf"));
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
It appears, at least in my case, that characters are correct. However the order of words or even letters is messy, super messy in fact, so this approach is unusable either.
What really surprises me is that all methods I have tried so far to retrieve text from PDFs, including copy/paste from Adobe Reader, screw something up.
I have come to the conclusion that the most reliable way to get some decent text extraction may also be the most unexpected: some good OCR. I am now trying to: 1) transform pdf into an image (PDFBox is great at doing that - do not even bother to try pdf-renderer) 2) OCR that image I will post my results in a few days.
Upvotes: 4
Views: 1475
Reputation: 2392
Your input document has been created in a strange (but 'legal') way. There is a Unicode mapping in the resources that maps arbitrary glyphs to Unicode points. In particular, character number 0x64, d
in ASCII, is mapped to the glyph with Unicode point 0x6f (UTF-8), which is o
, in this font. This is not a problem per se - any PDF viewer can handle it - but it is strange, because all other glyphs that are used are not "cross-mapped". e.g. character 0x63 is mapped to Unicode point 0x63 (which is c
), etc.
Now for the reason that Acrobat does the text extraction correctly (except for the space), and the others go wrong. We'll have to delve into the PDF syntax for this:
[p, -17.9, e, -15.1, l, 1.4, l, 8.4, i, -20, m, 5.8, i, 14, st, -17.5, e, 31.2, ,, -20.1, a] TJ
<</ActualText <fffffffeffffffff00640064> >> BDC
5.102 0 Td
[d, -14.2, d] TJ
EMC
That tells a PDF viewer to print p-e-l-l-i- -m-i-st-e- -a
on the first line of code, and d-d
after that on the fourth line. However, d
maps to o
, which is apparently only a problem for text extraction. Acrobat does do the text extraction correctly, because there is a content marker /ActualText
which says that whatever we write between the BDC and EMC markers must be parsed as dd (0x64,0x64).
So to answer your question: iText does this on the same level as a lot of well-respected viewers, which all ignore the /ActualText
marker. Except for Acrobat, which does respect it and overrules the ToUnicode mapping.
And to really answer your question: iText is currently looking into parsing the /ActualText
marker, but it will probably take a while before it gets into an official release.
Upvotes: 5
Reputation: 1143
This probably has to do with how the PDF with OCR'd in the first place, rather than with how iTextSharp is parsing the PDF's contents. Try copy/pasting the text from the PDF into Notepad, and see if the "ds -> os" transformation still occurs. If this is the case, you're going to have to do the following when parsing text from this particular PDF:
Upvotes: 0