iTextSharp can't read some PDF files

Question

I have a problem to read and display content of some PDFs into RichTextBox. I use the following code:

string fileName = @"C:\Users\PC\Desktop\SomePdf.pdf";
string str = string.Empty;

PdfReader reader = new PdfReader(fileName);

for (int i = 1; i <= reader.NumberOfPages; i++)
{
    ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
    String s = PdfTextExtractor.GetTextFromPage(reader, i, its);

    s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
    str = str + s;
    rtbVsebina.Text = str;
}

reader.Close();

Some PDFs can be read and displayed into RichTextBox and some they can not be. For those that can not be read I only get empty RichTextBox but with some added lines as I would press Key 'Enter' on the keyboard a couple of times.

Does anybody know what could be wrong?

Bruno Lowagie · Accepted Answer

You are confusing page content with page annotations.

Page content is part of the content stream of a page. It's referred to in the /Contents entry of the page dictionary and (optionally) in external objects (aka XObjects). With the code snippet you have copy/pasted in your question, you are extracting this content.

A rich text box is one of the many types of annotations. Annotations are not part of the content stream of a page. They are referred to from the /Annots entry of the page dictionary. If you want to get the contents of an annotation, you need to ask the page for its annotations instead of parsing the content of the page. See for instance Reading PDF Annotations with iText.

In answer to your question "What am I doing wrong": you were looking at the wrong place.

iTextSharp can't read some PDF files

Answers (1)

Related Questions

iTextSharp can&#39;t read some PDF files

Answers (1)

Related Questions

iTextSharp can't read some PDF files