Rok
Rok

Reputation: 133

iTextSharp can't read some PDF files

I have a problem to read and display content of some PDFs into RichTextBox. I use the following code:

string fileName = @"C:\Users\PC\Desktop\SomePdf.pdf";
string str = string.Empty;

PdfReader reader = new PdfReader(fileName);

for (int i = 1; i <= reader.NumberOfPages; i++)
{
    ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
    String s = PdfTextExtractor.GetTextFromPage(reader, i, its);

    s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
    str = str + s;
    rtbVsebina.Text = str;
}

reader.Close();

Some PDFs can be read and displayed into RichTextBox and some they can not be. For those that can not be read I only get empty RichTextBox but with some added lines as I would press Key 'Enter' on the keyboard a couple of times.

Does anybody know what could be wrong?

Upvotes: 0

Views: 1290

Answers (1)

Bruno Lowagie
Bruno Lowagie

Reputation: 77528

You are confusing page content with page annotations.

Page content is part of the content stream of a page. It's referred to in the /Contents entry of the page dictionary and (optionally) in external objects (aka XObjects). With the code snippet you have copy/pasted in your question, you are extracting this content.

A rich text box is one of the many types of annotations. Annotations are not part of the content stream of a page. They are referred to from the /Annots entry of the page dictionary. If you want to get the contents of an annotation, you need to ask the page for its annotations instead of parsing the content of the page. See for instance Reading PDF Annotations with iText.

In answer to your question "What am I doing wrong": you were looking at the wrong place.

Upvotes: 1

Related Questions