Henry Chinaski
Henry Chinaski

Reputation: 73

itext: how to tweak text extraction?

I'm using iText 5.5.8 for Java. Following the default, straightforward text extraction procedures, i.e.

PdfTextExtractor.getTextFromPage(reader, pageNumber)

I was surprised to find several mistakes in the output, specifically all letter ds come out as os.

So how does text extraction in iText really work? Is it some kind of OCR?

I took a look under the hood, trying to grasp how TextExtractionStrategy works, but I couldn't figure out much. SimpleTextExtractionStrategy for example seems to just determine the presence of lines and spaces, whereas it's TextRenderInfo that provides text by invoking some decode method on a GraphicsState's font field and that's as far as I could go without getting a major migraine.

So who's my man? Which class should I override or which parameter should I tweak to be able to tell iText "hey, you're reading all ds wrong!"

edit:

sample PDF can be found at http://www.fpozzi.com/stampastopper/download/ name of file is 0116_LR.pdf Sorry, can't share a direct link. This is some basic code for text extraction

import java.io.File;
import java.io.IOException;

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;

public class Import
{

    public static void importFromPdf(final File pdfFile) throws IOException
    {
        PdfReader reader = new PdfReader(pdfFile.getAbsolutePath());

        try
        {

            for (int i = 1; i <= reader.getNumberOfPages(); i++)
            {
                System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
                System.out.println("----------------------------------");
            }

        }
        catch (IOException e)
        {
            throw e;
        }
        finally
        {
            reader.close();
        }
    }

    public static void main(String[] args)
    {
        try
        {
            importFromPdf(new File("0116_LR.pdf"));
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
    }
}

edit after @blagae and @mkl answers

Before starting to fiddle with iText I have tried text extraction from Apache PDFBox (a project similar to iText I just discoreved) but it does have the same issue.

Understanding how these programs treat text is way beyond my dedication, so I have written a simple method to extract text from raw page content, that is whatever stands between BT and ET markers.

import java.io.File;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.itextpdf.text.io.RandomAccessSourceFactory;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.RandomAccessFileOrArray;
import com.itextpdf.text.pdf.parser.ContentByteUtils;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;

public class Import
{

    private final static Pattern actualWordPattern = Pattern.compile("\\((.*?)\\)");

    public static void importFromPdf(final File pdfFile) throws IOException
    {
        PdfReader reader = new PdfReader(pdfFile.getAbsolutePath());

        Matcher matcher;
        String line, extractedText;
        boolean anyMatchFound;
        try
        {
            for (int i = 1; i <= 16; i++)
            {
                byte[] contentBytes = ContentByteUtils.getContentBytesForPage(reader, i);
                RandomAccessFileOrArray raf = new RandomAccessFileOrArray(new RandomAccessSourceFactory().createSource(contentBytes));
                while ((line = raf.readLine()) != null && !line.equals("BT"));

                extractedText = "";
                while ((line = raf.readLine()) != null && !line.equals("ET"))
                {
                    anyMatchFound = false;
                    matcher = actualWordPattern.matcher(line);
                    while (matcher.find())
                    {
                        anyMatchFound = true;
                        extractedText += matcher.group(1);
                    }
                    if (anyMatchFound)
                        extractedText += "\n";
                }
                System.out.println(extractedText);
                System.out.println("+++++++++++++++++++++++++++");
                String properlyExtractedText = PdfTextExtractor.getTextFromPage(reader, i);
                System.out.println(properlyExtractedText);
                System.out.println("---------------------------");
            }
        }
        catch (IOException e)
        {
            throw e;
        }
        finally
        {
            reader.close();
        }
    }

    public static void main(String[] args)
    {
        try
        {
            importFromPdf(new File("0116_LR.pdf"));
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
    }
}

It appears, at least in my case, that characters are correct. However the order of words or even letters is messy, super messy in fact, so this approach is unusable either.

What really surprises me is that all methods I have tried so far to retrieve text from PDFs, including copy/paste from Adobe Reader, screw something up.

I have come to the conclusion that the most reliable way to get some decent text extraction may also be the most unexpected: some good OCR. I am now trying to: 1) transform pdf into an image (PDFBox is great at doing that - do not even bother to try pdf-renderer) 2) OCR that image I will post my results in a few days.

Upvotes: 4

Views: 1475

Answers (2)

blagae
blagae

Reputation: 2392

Your input document has been created in a strange (but 'legal') way. There is a Unicode mapping in the resources that maps arbitrary glyphs to Unicode points. In particular, character number 0x64, d in ASCII, is mapped to the glyph with Unicode point 0x6f (UTF-8), which is o, in this font. This is not a problem per se - any PDF viewer can handle it - but it is strange, because all other glyphs that are used are not "cross-mapped". e.g. character 0x63 is mapped to Unicode point 0x63 (which is c), etc.

Faulty Unicode entry

Now for the reason that Acrobat does the text extraction correctly (except for the space), and the others go wrong. We'll have to delve into the PDF syntax for this:

[p, -17.9, e, -15.1, l, 1.4, l, 8.4, i, -20,  m, 5.8, i, 14, st, -17.5, e, 31.2, ,, -20.1,  a] TJ
<</ActualText <fffffffeffffffff00640064> >> BDC
5.102 0 Td
[d, -14.2, d] TJ
EMC

That tells a PDF viewer to print p-e-l-l-i- -m-i-st-e- -a on the first line of code, and d-d after that on the fourth line. However, d maps to o, which is apparently only a problem for text extraction. Acrobat does do the text extraction correctly, because there is a content marker /ActualText which says that whatever we write between the BDC and EMC markers must be parsed as dd (0x64,0x64).

So to answer your question: iText does this on the same level as a lot of well-respected viewers, which all ignore the /ActualText marker. Except for Acrobat, which does respect it and overrules the ToUnicode mapping.

And to really answer your question: iText is currently looking into parsing the /ActualText marker, but it will probably take a while before it gets into an official release.

Upvotes: 5

Brian Snow
Brian Snow

Reputation: 1143

This probably has to do with how the PDF with OCR'd in the first place, rather than with how iTextSharp is parsing the PDF's contents. Try copy/pasting the text from the PDF into Notepad, and see if the "ds -> os" transformation still occurs. If this is the case, you're going to have to do the following when parsing text from this particular PDF:

  1. Identify all occurrences of the string "os".
  2. Decide whether or not the word of which the given "os" instance is a constituent is a valid English/German/Spanish/ word.
  3. If it IS a valid word, do nothing.
  4. If it is NOT a valid word, perform the reverse "os -> ds" transformation, and check against the dictionary in the language of your choice again.

Upvotes: 0

Related Questions