Reading Hebrew PDF with iText 7 get Gibrish

Question

I am trying to read a Hebrew PDF, but I am getting Gibrish instead.

I am using the code @mkl gave me a year ago when I had a similar problem, as described below, but unfortunately, it is not working.

for (int i = 1; i <= pdfDocument.GetNumberOfPages(); i++) {
  PdfPage page1 = pdfDocument.GetPage(i);
  PdfDictionary fontResources = page1.GetResources().GetResource(PdfName.Font);
  foreach (PdfObject font in fontResources.Values(true))
  {
    if (font is PdfDictionary fontDict)
        fontDict.Put(PdfName.Encoding, PdfName.IdentityH);
  }
  // get page size
  Rectangle pageSize = pdfDocument.GetPage(i).GetMediaBox();
  float pageHeight = pageSize.GetHeight();
  float pageWidth = pageSize.GetWidth();

  // set location
  Rectangle rect = new Rectangle(0, 0, pageWidth, pageHeight);
  TextRegionEventFilter regionFilter = new TextRegionEventFilter(rect);
                    
  ITextExtractionStrategy strategy = new FilteredTextEventListener(new LocationTextExtractionStrategy(), regionFilter);
  inputStr = PdfTextExtractor.GetTextFromPage(page1, strategy);

  // rest of code...
}

The output (inputStr) is total gibrish:

�����������
��������
������ �����������������
���
�����������
����������������
��������������������������� ������
���
���������
��������������������������
����������������������
�������

Since the PDF has sensitive data, I can't really share it publicly...

Appreciate your help, Yaniv

mkl · Accepted Answer

The cause of this issue are invalid ToUnicode CMaps for all the fonts in the PDF: These CMaps may be valid for other uses but in the context of ToUnicode CMaps the PDF specification clearly restricts the data which may occur in this kind of CMap.

One can enable iText, though, to make sense out of this by a small patch.

The problem

The ToUnicode CMaps in the document are invalid in particular as they use begincidrange ... endcidrange sections for mapping the character codes instead of beginbfrange ... endbfrange and beginbfchar ... endbfchar sections as required here by the specification.

By chance iText does process ~cidrange sections in ToUnicode CMaps just like ~bfrange sections. (Well, not really by chance but because the class for processing ToUnicode CMaps extends an abstract base class for processing arbitrary CMaps.)

Unfortunately, though, ~cidrange ranges have an integer destination start value (e.g. <0003> <0003> 32) while ~bfrange ranges in ToUnicode CMaps must have hex string destination start values (e.g. <0000><005E><0020>). As a result parsing these ToUnicode CMaps with ~cidrange ranges in iText fails with an exception, leaving behind a map without entries. The result is that none of the character codes in the document can be mapped to anything sensible and all text is extracted as replacement characters ('�').

A work-around

An obvious approach to fix this is by enabling the code processing the ranges to also handle integer destination start values, not only hex string values.

With that fix applied most of the characters can be extracted properly. You merely have to deal with the wrong order of RTL script text.

(I tested only the Java version but the .Net version should work, too.)

Java

In the com.itextpdf.io.font.cmap.CMapToUnicode class (com.itextpdf.io artifact) the following method is called to add mappings derived from ranges:

@Override
void addChar(String mark, CMapObject code) {
    if (mark.length() == 1) {
        char[] dest = createCharsFromDoubleBytes((byte[]) code.getValue());
        byteMappings.put((int) mark.charAt(0), dest);
    } else if (mark.length() == 2) {
        char[] dest = createCharsFromDoubleBytes((byte[]) code.getValue());
        byteMappings.put((mark.charAt(0) << 8) + mark.charAt(1), dest);
    } else {
        Logger logger = LoggerFactory.getLogger(CMapToUnicode.class);
        logger.warn(LogMessageConstant.TOUNICODE_CMAP_MORE_THAN_2_BYTES_NOT_SUPPORTED);
    }
}

Here code.getValue() without any test is cast to byte[] as that's how hex strings are represented here. Integer values in contrast would be represented as Integer instances. Thus, the

char[] dest = createCharsFromDoubleBytes((byte[]) code.getValue());

lines have to be replaced by

char[] dest = (code.getValue() instanceof Integer) ? TextUtil.convertFromUtf32((Integer)code.getValue()) : createCharsFromDoubleBytes((byte[]) code.getValue());

lines.

.Net

In the iText.IO.Font.Cmap.CMapToUnicode class (itext.io.netstandard assembly) the following method is called to add mappings derived from ranges:

internal override void AddChar(String mark, CMapObject code) {
    if (mark.Length == 1) {
        char[] dest = CreateCharsFromDoubleBytes((byte[])code.GetValue());
        byteMappings.Put((int)mark[0], dest);
    }
    else {
        if (mark.Length == 2) {
            char[] dest = CreateCharsFromDoubleBytes((byte[])code.GetValue());
            byteMappings.Put((mark[0] << 8) + mark[1], dest);
        }
        else {
            ILog logger = LogManager.GetLogger(typeof(iText.IO.Font.Cmap.CMapToUnicode));
            logger.Warn(iText.IO.LogMessageConstant.TOUNICODE_CMAP_MORE_THAN_2_BYTES_NOT_SUPPORTED);
        }
    }
}

Here code.GetValue() without any test is cast to byte[] as that's how hex strings are represented here. Integer values in contrast would be represented as int instances. Like in the Java case, therefore, it should suffice to replace the

char[] dest = CreateCharsFromDoubleBytes((byte[])code.GetValue());

lines by

char[] dest = (code.GetValue() is int) ? TextUtil.ConvertFromUtf32((int)code.GetValue()) : CreateCharsFromDoubleBytes((byte[])code.GetValue());