Reputation: 59
I need to Extract and Read only the annotation of PDF using C#.
I can extract the file without any problem by using both PDFBox and itextsharp but I need to read the annotation text or underlined or coloured (highlighted lines).
Any idea?
Upvotes: 1
Views: 1382
Reputation: 77606
You need to understand that there is a difference between the actual content of a page (the content that is described using PDF syntax in the content stream of a page) and the annotations that are added to a page (the content that is described in the annotation dictionaries in the /Annots
entry of the page dictionary).
So far, you are extracting the content of the annotation dictionaries, but you also want to extract the content from the content stream of which the location is identified using the /Rect
entry of the annotation. You need to parse the content stream of the page to do that.
Please go to the official iText web site and read the FAQ, more specifically: How to read text from a specific position?
Suppose that reader
is your PdfReader
instance, rect
is the Rectangle
defining the location of the text you want to extract, and page
the corresponding page number, then you can create a RenderFilter
and use the LocationTextExtractionStrategy
like this:
RenderFilter[] filter = {new RegionTextRenderFilter(rect)};
ITextExtractionStrategy strategy =
new FilteredTextRenderListener(
new LocationTextExtractionStrategy(), filter);
String text = PdfTextExtractor.GetTextFromPage(reader, page, strategy));
Upvotes: 1