Reputation: 387
I am trying to read a Hebrew PDF, but I am getting Gibrish instead.
I am using the code @mkl gave me a year ago when I had a similar problem, as described below, but unfortunately, it is not working.
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); i++) {
PdfPage page1 = pdfDocument.GetPage(i);
PdfDictionary fontResources = page1.GetResources().GetResource(PdfName.Font);
foreach (PdfObject font in fontResources.Values(true))
{
if (font is PdfDictionary fontDict)
fontDict.Put(PdfName.Encoding, PdfName.IdentityH);
}
// get page size
Rectangle pageSize = pdfDocument.GetPage(i).GetMediaBox();
float pageHeight = pageSize.GetHeight();
float pageWidth = pageSize.GetWidth();
// set location
Rectangle rect = new Rectangle(0, 0, pageWidth, pageHeight);
TextRegionEventFilter regionFilter = new TextRegionEventFilter(rect);
ITextExtractionStrategy strategy = new FilteredTextEventListener(new LocationTextExtractionStrategy(), regionFilter);
inputStr = PdfTextExtractor.GetTextFromPage(page1, strategy);
// rest of code...
}
The output (inputStr) is total gibrish:
�����������\n��������\n������ �����������������\n���\n�����������\n����������������\n��������������������������� ������\n���\n���������\n��������������������������\n����������������������\n�������
Since the PDF has sensitive data, I can't really share it publicly...
Appreciate your help, Yaniv
Upvotes: 1
Views: 558
Reputation: 95918
The cause of this issue are invalid ToUnicode CMaps for all the fonts in the PDF: These CMaps may be valid for other uses but in the context of ToUnicode CMaps the PDF specification clearly restricts the data which may occur in this kind of CMap.
One can enable iText, though, to make sense out of this by a small patch.
The ToUnicode CMaps in the document are invalid in particular as they use begincidrange ... endcidrange sections for mapping the character codes instead of beginbfrange ... endbfrange and beginbfchar ... endbfchar sections as required here by the specification.
By chance iText does process ~cidrange sections in ToUnicode CMaps just like ~bfrange sections. (Well, not really by chance but because the class for processing ToUnicode CMaps extends an abstract base class for processing arbitrary CMaps.)
Unfortunately, though, ~cidrange ranges have an integer destination start value (e.g. <0003> <0003> 32
) while ~bfrange ranges in ToUnicode CMaps must have hex string destination start values (e.g. <0000><005E><0020>
). As a result parsing these ToUnicode CMaps with ~cidrange ranges in iText fails with an exception, leaving behind a map without entries. The result is that none of the character codes in the document can be mapped to anything sensible and all text is extracted as replacement characters ('�').
An obvious approach to fix this is by enabling the code processing the ranges to also handle integer destination start values, not only hex string values.
With that fix applied most of the characters can be extracted properly. You merely have to deal with the wrong order of RTL script text.
(I tested only the Java version but the .Net version should work, too.)
In the com.itextpdf.io.font.cmap.CMapToUnicode
class (com.itextpdf.io
artifact) the following method is called to add mappings derived from ranges:
@Override
void addChar(String mark, CMapObject code) {
if (mark.length() == 1) {
char[] dest = createCharsFromDoubleBytes((byte[]) code.getValue());
byteMappings.put((int) mark.charAt(0), dest);
} else if (mark.length() == 2) {
char[] dest = createCharsFromDoubleBytes((byte[]) code.getValue());
byteMappings.put((mark.charAt(0) << 8) + mark.charAt(1), dest);
} else {
Logger logger = LoggerFactory.getLogger(CMapToUnicode.class);
logger.warn(LogMessageConstant.TOUNICODE_CMAP_MORE_THAN_2_BYTES_NOT_SUPPORTED);
}
}
Here code.getValue()
without any test is cast to byte[]
as that's how hex strings are represented here. Integer values in contrast would be represented as Integer
instances. Thus, the
char[] dest = createCharsFromDoubleBytes((byte[]) code.getValue());
lines have to be replaced by
char[] dest = (code.getValue() instanceof Integer) ? TextUtil.convertFromUtf32((Integer)code.getValue()) : createCharsFromDoubleBytes((byte[]) code.getValue());
lines.
In the iText.IO.Font.Cmap.CMapToUnicode
class (itext.io.netstandard
assembly) the following method is called to add mappings derived from ranges:
internal override void AddChar(String mark, CMapObject code) {
if (mark.Length == 1) {
char[] dest = CreateCharsFromDoubleBytes((byte[])code.GetValue());
byteMappings.Put((int)mark[0], dest);
}
else {
if (mark.Length == 2) {
char[] dest = CreateCharsFromDoubleBytes((byte[])code.GetValue());
byteMappings.Put((mark[0] << 8) + mark[1], dest);
}
else {
ILog logger = LogManager.GetLogger(typeof(iText.IO.Font.Cmap.CMapToUnicode));
logger.Warn(iText.IO.LogMessageConstant.TOUNICODE_CMAP_MORE_THAN_2_BYTES_NOT_SUPPORTED);
}
}
}
Here code.GetValue()
without any test is cast to byte[]
as that's how hex strings are represented here. Integer values in contrast would be represented as int
instances.
Like in the Java case, therefore, it should suffice to replace the
char[] dest = CreateCharsFromDoubleBytes((byte[])code.GetValue());
lines by
char[] dest = (code.GetValue() is int) ? TextUtil.ConvertFromUtf32((int)code.GetValue()) : CreateCharsFromDoubleBytes((byte[])code.GetValue());
lines.
Upvotes: 2