WNET
WNET

Reputation: 29

How to extract text from PDF file with IDENTITY-H fonts using VB.NET

I have a PDF file.

I am reading Text from PDF file pro-grammatically using iTextSharp class. It does read Ansi Encoding Texts but It does not read IDENTITY-H Encoding Texts.

My problem is how to read IDENTITY-H texts from pdf file using VB.Net

Below is my code:

  1. Public Function ReadPDFFile(ByVal strSource As String) As String

    Dim sbPDFText As New StringBuilder() 'StringBuilder Object To Store read Text
    
    If File.Exists(strSource) Then 'Does File Exist?
        Dim pdfFileReader As New PdfReader(strSource) 'read File
        For intCurrPage As Integer = 1 To pdfFileReader.NumberOfPages 'Loop Through All Pages
    
            Dim lteStrategy As LocTextExtractionStrategy = New LocTextExtractionStrategy 'Read PDF File Content Blocks
            'Get Text
            Dim strCurrText As String = PdfTextExtractor.GetTextFromPage(pdfFileReader, intCurrPage, lteStrategy)
    
            sbPDFText.Append(strCurrText) 'Add Text To String Builder
        Next
        pdfFileReader.Close() 'Close File
    End If
    Return sbPDFText.ToString() 'Return 
    

    End Function

    1. Public Overridable Sub RenderText(ByVal renderInfo As TextRenderInfo) Implements ITextExtractionStrategy.RenderText

      Dim segment As LineSegment = renderInfo.GetBaseline()
      Dim location As New TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth())
      
      If renderInfo.GetText = "" Then
          Console.WriteLine(GetResultantText())
      End If
      With location
          'Chunk Location:
          Debug.Print(renderInfo.GetText)
          .PosLeft = renderInfo.GetDescentLine.GetStartPoint(Vector.I1)
          .PosRight = renderInfo.GetAscentLine.GetEndPoint(Vector.I1)
          .PosBottom = renderInfo.GetDescentLine.GetStartPoint(Vector.I2)
          .PosTop = renderInfo.GetAscentLine.GetEndPoint(Vector.I2)
          'Chunk Font Size: (Height)
          .curFontSize = .PosTop - segment.GetStartPoint()(Vector.I2)
          'Use Font name  and Size as Key in the SortedList
          Dim StrKey As String = renderInfo.GetFont.PostscriptFontName & .curFontSize.ToString
          'Add this font to ThisPdfDocFonts SortedList if it's not already present
          If 1 = 1 Then
              If Not ThisPdfDocFonts.ContainsKey(StrKey) Then ThisPdfDocFonts.Add(StrKey, renderInfo.GetFont)
              'Store the SortedList index in this Chunk, so we can get it later
              .FontIndex = ThisPdfDocFonts.IndexOfKey(StrKey)
              Console.WriteLine(renderInfo.GetFont.ToString & "-->" & StrKey)
          Else
              'pcbContent.SetFontAndSize(BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED), 9)
              .FontIndex = 3
              .curFontSize = 8
          End If
      End With
      locationalResult.Add(location)
      

      End Sub

Upvotes: 1

Views: 10406

Answers (1)

Bruno Lowagie
Bruno Lowagie

Reputation: 77538

Thank you for sharing the PDF document. It helped us to determine that the problem you describe is not an iTextSharp problem. Instead it is a problem with the PDF document itself.

This problem doesn't have a solution, but I'm providing this answer to explain how you can discover for yourself that the problem also exists when iTextSharp isn't involved.

Open the document in Adobe Reader. Select the text "Muy señores nuestros" and copy/paste it into a text editor. You get "Muy señores nuestros". This is text that can be extracted using iTextSharp (it works correctly).

Now do the same with the text "GUARDIAN GLASS EXPRESS, S.L.". You get the following result: "􀀪􀀸􀀤􀀵􀀧􀀬􀀤􀀱􀀃􀀪􀀯􀀤􀀶􀀶􀀃􀀨􀀻􀀳􀀵􀀨􀀶􀀶􀀏􀀃􀀶􀀑􀀯􀀑". As you can see, you can not copy/paste the text correctly from Adobe Reader. This is due to the way the text is stored in the PDF. If you can not copy/paste the text from Adobe Reader, you should not expect to be able to extract the text using iTextSharp. The PDF is created in a way that doesn't allow extraction.

Please take a look at this video to find out some possible causes: https://www.youtube.com/watch?v=wxGEEv7ibHE

I'm sorry that it took so long to figure this out and that it turns out that you're asking something that isn't possible. Your question narrowed the problem down too much, as if the problem was caused by the "IDENTITY-H" encoding and iTextSharp. In reality, you're trying to extract text that can't be extracted.

If you look at the page dictionary inside the PDF, you'll find three font resources for the first (and only) page:

enter image description here

In the content stream (below) small red arrow, you see two strings (hexadecimal notation) that are shown using fonts referenced using the names C2_0 and C2_1. Incidentally, these fonts are stored as composite fonts with /SubType 0 and /Encoding Identity-H. This means that the characters used in the hexadecimal string should correspond with the UNICODE values of the glyphs. If that's not the case, you're out of luck.

There seems to be no problem with the font for which the name /TT0 is used.

The fact that /TT0 uses WinAnsiEncoding and the other fonts use Identity-H is irrelevant. There are plenty of PDF files with fonts that use Identity-H of which the text can be copy/pasted or extracted using iTextSharp. Unfortunately, there is probably something wrong with the way your PDF was constructed. It would take too much time to analyze what went wrong, so your best shot is to contact the person who gave you the PDF and to ask him/her to fix the PDF.

Upvotes: 3

Related Questions