How to extract text from PDF file with IDENTITY-H fonts using VB.NET

Question

I have a PDF file.

I am reading Text from PDF file pro-grammatically using iTextSharp class. It does read Ansi Encoding Texts but It does not read IDENTITY-H Encoding Texts.

My problem is how to read IDENTITY-H texts from pdf file using VB.Net

Below is my code:

Public Function ReadPDFFile(ByVal strSource As String) As String

Dim sbPDFText As New StringBuilder() 'StringBuilder Object To Store read Text

If File.Exists(strSource) Then 'Does File Exist?
    Dim pdfFileReader As New PdfReader(strSource) 'read File
    For intCurrPage As Integer = 1 To pdfFileReader.NumberOfPages 'Loop Through All Pages

        Dim lteStrategy As LocTextExtractionStrategy = New LocTextExtractionStrategy 'Read PDF File Content Blocks
        'Get Text
        Dim strCurrText As String = PdfTextExtractor.GetTextFromPage(pdfFileReader, intCurrPage, lteStrategy)

        sbPDFText.Append(strCurrText) 'Add Text To String Builder
    Next
    pdfFileReader.Close() 'Close File
End If
Return sbPDFText.ToString() 'Return

End Function

Public Overridable Sub RenderText(ByVal renderInfo As TextRenderInfo) Implements ITextExtractionStrategy.RenderText

Dim segment As LineSegment = renderInfo.GetBaseline()
Dim location As New TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth())

If renderInfo.GetText = "" Then
    Console.WriteLine(GetResultantText())
End If
With location
    'Chunk Location:
    Debug.Print(renderInfo.GetText)
    .PosLeft = renderInfo.GetDescentLine.GetStartPoint(Vector.I1)
    .PosRight = renderInfo.GetAscentLine.GetEndPoint(Vector.I1)
    .PosBottom = renderInfo.GetDescentLine.GetStartPoint(Vector.I2)
    .PosTop = renderInfo.GetAscentLine.GetEndPoint(Vector.I2)
    'Chunk Font Size: (Height)
    .curFontSize = .PosTop - segment.GetStartPoint()(Vector.I2)
    'Use Font name  and Size as Key in the SortedList
    Dim StrKey As String = renderInfo.GetFont.PostscriptFontName & .curFontSize.ToString
    'Add this font to ThisPdfDocFonts SortedList if it's not already present
    If 1 = 1 Then
        If Not ThisPdfDocFonts.ContainsKey(StrKey) Then ThisPdfDocFonts.Add(StrKey, renderInfo.GetFont)
        'Store the SortedList index in this Chunk, so we can get it later
        .FontIndex = ThisPdfDocFonts.IndexOfKey(StrKey)
        Console.WriteLine(renderInfo.GetFont.ToString & "-->" & StrKey)
    Else
        'pcbContent.SetFontAndSize(BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED), 9)
        .FontIndex = 3
        .curFontSize = 8
    End If
End With
locationalResult.Add(location)

End Sub

Bruno Lowagie · Accepted Answer

Thank you for sharing the PDF document. It helped us to determine that the problem you describe is not an iTextSharp problem. Instead it is a problem with the PDF document itself.

This problem doesn't have a solution, but I'm providing this answer to explain how you can discover for yourself that the problem also exists when iTextSharp isn't involved.

Open the document in Adobe Reader. Select the text "Muy señores nuestros" and copy/paste it into a text editor. You get "Muy señores nuestros". This is text that can be extracted using iTextSharp (it works correctly).

Now do the same with the text "GUARDIAN GLASS EXPRESS, S.L.". You get the following result: "􀀪􀀸􀀤􀀵􀀧􀀬􀀤􀀱􀀃􀀪􀀯􀀤􀀶􀀶􀀃􀀨􀀻􀀳􀀵􀀨􀀶􀀶􀀏􀀃􀀶􀀑􀀯􀀑". As you can see, you can not copy/paste the text correctly from Adobe Reader. This is due to the way the text is stored in the PDF. If you can not copy/paste the text from Adobe Reader, you should not expect to be able to extract the text using iTextSharp. The PDF is created in a way that doesn't allow extraction.

Please take a look at this video to find out some possible causes: https://www.youtube.com/watch?v=wxGEEv7ibHE

I'm sorry that it took so long to figure this out and that it turns out that you're asking something that isn't possible. Your question narrowed the problem down too much, as if the problem was caused by the "IDENTITY-H" encoding and iTextSharp. In reality, you're trying to extract text that can't be extracted.

If you look at the page dictionary inside the PDF, you'll find three font resources for the first (and only) page:

enter image description here

In the content stream (below) small red arrow, you see two strings (hexadecimal notation) that are shown using fonts referenced using the names C2_0 and C2_1. Incidentally, these fonts are stored as composite fonts with /SubType 0 and /Encoding Identity-H. This means that the characters used in the hexadecimal string should correspond with the UNICODE values of the glyphs. If that's not the case, you're out of luck.

There seems to be no problem with the font for which the name /TT0 is used.

The fact that /TT0 uses WinAnsiEncoding and the other fonts use Identity-H is irrelevant. There are plenty of PDF files with fonts that use Identity-H of which the text can be copy/pasted or extracted using iTextSharp. Unfortunately, there is probably something wrong with the way your PDF was constructed. It would take too much time to analyze what went wrong, so your best shot is to contact the person who gave you the PDF and to ask him/her to fix the PDF.

How to extract text from PDF file with IDENTITY-H fonts using VB.NET

Answers (1)

Related Questions