Reputation: 29
I have a PDF file.
I am reading Text from PDF file pro-grammatically using iTextSharp class. It does read Ansi Encoding Texts but It does not read IDENTITY-H Encoding Texts.
My problem is how to read IDENTITY-H texts from pdf file using VB.Net
Below is my code:
Public Function ReadPDFFile(ByVal strSource As String) As String
Dim sbPDFText As New StringBuilder() 'StringBuilder Object To Store read Text
If File.Exists(strSource) Then 'Does File Exist?
Dim pdfFileReader As New PdfReader(strSource) 'read File
For intCurrPage As Integer = 1 To pdfFileReader.NumberOfPages 'Loop Through All Pages
Dim lteStrategy As LocTextExtractionStrategy = New LocTextExtractionStrategy 'Read PDF File Content Blocks
'Get Text
Dim strCurrText As String = PdfTextExtractor.GetTextFromPage(pdfFileReader, intCurrPage, lteStrategy)
sbPDFText.Append(strCurrText) 'Add Text To String Builder
Next
pdfFileReader.Close() 'Close File
End If
Return sbPDFText.ToString() 'Return
End Function
Public Overridable Sub RenderText(ByVal renderInfo As TextRenderInfo) Implements ITextExtractionStrategy.RenderText
Dim segment As LineSegment = renderInfo.GetBaseline()
Dim location As New TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth())
If renderInfo.GetText = "" Then
Console.WriteLine(GetResultantText())
End If
With location
'Chunk Location:
Debug.Print(renderInfo.GetText)
.PosLeft = renderInfo.GetDescentLine.GetStartPoint(Vector.I1)
.PosRight = renderInfo.GetAscentLine.GetEndPoint(Vector.I1)
.PosBottom = renderInfo.GetDescentLine.GetStartPoint(Vector.I2)
.PosTop = renderInfo.GetAscentLine.GetEndPoint(Vector.I2)
'Chunk Font Size: (Height)
.curFontSize = .PosTop - segment.GetStartPoint()(Vector.I2)
'Use Font name and Size as Key in the SortedList
Dim StrKey As String = renderInfo.GetFont.PostscriptFontName & .curFontSize.ToString
'Add this font to ThisPdfDocFonts SortedList if it's not already present
If 1 = 1 Then
If Not ThisPdfDocFonts.ContainsKey(StrKey) Then ThisPdfDocFonts.Add(StrKey, renderInfo.GetFont)
'Store the SortedList index in this Chunk, so we can get it later
.FontIndex = ThisPdfDocFonts.IndexOfKey(StrKey)
Console.WriteLine(renderInfo.GetFont.ToString & "-->" & StrKey)
Else
'pcbContent.SetFontAndSize(BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED), 9)
.FontIndex = 3
.curFontSize = 8
End If
End With
locationalResult.Add(location)
End Sub
Upvotes: 1
Views: 10406
Reputation: 77538
Thank you for sharing the PDF document. It helped us to determine that the problem you describe is not an iTextSharp problem. Instead it is a problem with the PDF document itself.
This problem doesn't have a solution, but I'm providing this answer to explain how you can discover for yourself that the problem also exists when iTextSharp isn't involved.
Open the document in Adobe Reader. Select the text "Muy señores nuestros" and copy/paste it into a text editor. You get "Muy señores nuestros". This is text that can be extracted using iTextSharp (it works correctly).
Now do the same with the text "GUARDIAN GLASS EXPRESS, S.L.". You get the following result: "". As you can see, you can not copy/paste the text correctly from Adobe Reader. This is due to the way the text is stored in the PDF. If you can not copy/paste the text from Adobe Reader, you should not expect to be able to extract the text using iTextSharp. The PDF is created in a way that doesn't allow extraction.
Please take a look at this video to find out some possible causes: https://www.youtube.com/watch?v=wxGEEv7ibHE
I'm sorry that it took so long to figure this out and that it turns out that you're asking something that isn't possible. Your question narrowed the problem down too much, as if the problem was caused by the "IDENTITY-H" encoding and iTextSharp. In reality, you're trying to extract text that can't be extracted.
If you look at the page dictionary inside the PDF, you'll find three font resources for the first (and only) page:
In the content stream (below) small red arrow, you see two strings (hexadecimal notation) that are shown using fonts referenced using the names C2_0
and C2_1
. Incidentally, these fonts are stored as composite fonts with /SubType
0 and /Encoding
Identity-H. This means that the characters used in the hexadecimal string should correspond with the UNICODE values of the glyphs. If that's not the case, you're out of luck.
There seems to be no problem with the font for which the name /TT0
is used.
The fact that /TT0
uses WinAnsiEncoding and the other fonts use Identity-H is irrelevant. There are plenty of PDF files with fonts that use Identity-H of which the text can be copy/pasted or extracted using iTextSharp. Unfortunately, there is probably something wrong with the way your PDF was constructed. It would take too much time to analyze what went wrong, so your best shot is to contact the person who gave you the PDF and to ask him/her to fix the PDF.
Upvotes: 3