Reputation: 133
I have a problem to read and display content of some PDFs into RichTextBox
.
I use the following code:
string fileName = @"C:\Users\PC\Desktop\SomePdf.pdf";
string str = string.Empty;
PdfReader reader = new PdfReader(fileName);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, i, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
str = str + s;
rtbVsebina.Text = str;
}
reader.Close();
Some PDFs can be read and displayed into RichTextBox
and some they can not be. For those that can not be read I only get empty RichTextBox
but with some added lines as I would press Key 'Enter' on the keyboard a couple of times.
Does anybody know what could be wrong?
Upvotes: 0
Views: 1290
Reputation: 77528
You are confusing page content with page annotations.
Page content is part of the content stream of a page. It's referred to in the /Contents
entry of the page dictionary and (optionally) in external objects (aka XObjects
). With the code snippet you have copy/pasted in your question, you are extracting this content.
A rich text box is one of the many types of annotations. Annotations are not part of the content stream of a page. They are referred to from the /Annots
entry of the page dictionary. If you want to get the contents of an annotation, you need to ask the page for its annotations instead of parsing the content of the page. See for instance Reading PDF Annotations with iText.
In answer to your question "What am I doing wrong": you were looking at the wrong place.
Upvotes: 1