iText PDF bad character conversion

Question

i have a PDF to read that is making me craszy.

The pdf rapresent the electricity bill (in italian language) of a customer and he want me to read text from it.

Now the problem. When i copy paste text from pdf to notepad i get a bunch of incomprehensible characters...

after a lot of research I found my answer. The pdf contains all fonts but not ontiene the cmap corresponding to allow the export of the text. I found this link which refers however to an older version of itext(I'm using version 5.5.5).

what I want to achieve, if possible, is the conversion of text from glyph code to unicode.

I've found some reference to Cmap-something but dunno how to use them and apparently no examples on the net :(

this is what i've tryed

PdfReader reader = new PdfReader("MyFile.pdf");
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
strategy = parser.processContent(1, new SimpleTextExtractionStrategy());
String text = strategy.getResultantText();
String cmapFile="UnicodeBigUnmarked";
byte[] text = encodedText.getBytes();
String cid = PdfEncodings.convertToString(text, cmapFile);

The Cid is a pretty japanes sequence of chars

and also:

FontFactory.registerDirectory("myDirectoryWithAllFonts");

Just before trying the conversion. This solution seems to give no results

Any help will be appreciated.

Bruno Lowagie · Accepted Answer

You say: When i copy paste text from pdf to notepad i get a bunch of incomprehensible characters. I assume that you are talking about selecting text in Adobe Reader and trying to paste it in a text editor.

If this doesn't succeed, you have a PDF that doesn't allow you to extract text from the PDF because the text isn't stored in the PDF correctly. Watch this video for the full explanation.

Let's take a look at your PDF from the inside:

enter image description here

We see the start of a text object (where it says BT which stands for Begin Text). A font /C2_1 is defined with font size 1. At first sight, this may look odd, but the font will be scaled to size 6.9989 in a transformation. Then we see some text arrays containing strings of double byte characters such as I R H E Z M W M S R I H I P.

How should iText interpret these characters? To find out, we need to look at the encoding that is used for the font corresponding with /C2_1:

enter image description here

Aha, the Unicode characters stored in the content stream correspond with the actual characters we need: IRHE ZMWMSRI HIP and so on. That's exactly what we see when we convert the PDF to text using iText.

But wait a minute! How come that we see other characters when we look at the PDF using Adobe Reader? Well, characters such as I, R, H and so on are addresses that correspond with the "program" of a glyph. This program is responsible for drawing the character on the page. One would expect that in this case, the character I would correspond with the glyph (or "the drawing" if you prefer this word) of the letter I. No such luck in your PDF.

Now what does Adobe do when you use "Copy with formatting"? Plenty of magic that currently isn't implemented in iText. Why not? Hmm... I don't know the budget of Adobe, but it's probably much, much higher than the budget of the iText Group. Extracting text from documents that contain confusing information about fonts isn't on the technical roadmap of the iText Group.

iText PDF bad character conversion

Answers (1)

Related Questions