Reputation: 91
I highlighted the word in pdf using the code in the answer to the following question: Highlight words in a pdf using itextsharp, not displaying highlighted word in browser
Now I want to know how to remove those highlighted rectangles using iTextSharp.
private void RemovehighlightPDFAnnotation(string outputFile, string highLightFile, int pageno, string highLightedText)
{
PdfReader reader = new PdfReader(outputFile);
using (FileStream fs = new FileStream(highLightFile, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (PdfStamper stamper = new PdfStamper(reader, fs))
{
PdfDictionary pageDict = reader.GetPageN(pageno);
PdfArray annots = pageDict.GetAsArray(PdfName.ANNOTS);
if (annots != null)
{
for (int i = 0; i < annots.Size; ++i)
{
PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annots[i]);
PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
if (subType.Equals(PdfName.HIGHLIGHT))
{
PdfString str = annots.GetAsString(i);
if(str==highLightedText)
{
annots.Remove(i);
}
}
}
}
}
}
It removes all annotation but i want to remove particular annotation. Suppose i highlighted united states and Patent Application Publication in page no 1, now i want to remove united states alone. I will pass the text united states.
I refered this answer. In that, to get the highlighted text, you need to get the coordinates stored in the Highlight annotation (stored in the QuadPoints array) and you need to use these coordinates to parse the text that is present in the page content at those coordinates.
Upvotes: 0
Views: 1281
Reputation: 96064
As the OP clarified, he actually wants to
get the highlighted annotation coordinates
to extract the text from that area, check whether it matches the phrase in question, and (if it does) remove the annotation.
As the code in question always only marks a single rectangle with each annotation and chose the rectangle to only contain the text in question, he can simply use the annotation rectangle
annotationDic.GetAsArray(PdfName.RECT)
In a more generic case (i.e. for highlight annotations starting on the end of one line and ending at the start of the next), he'd need to check the quad points
annotationDic.GetAsArray(PdfName.QUADPOINTS)
which describe a set of quadrilaterals.
E.g. in case of the sample from the referenced question (highlighting the occurrence of the word "support" on the third document page of the OP's sample PDF), the method
private void ReportHighlightPDFAnnotation(string highLightFile, int pageno)
{
PdfReader reader = new PdfReader(highLightFile);
PdfDictionary pageDict = reader.GetPageN(pageno);
PdfArray annots = pageDict.GetAsArray(PdfName.ANNOTS);
if (annots != null)
{
for (int i = 0; i < annots.Size; ++i)
{
PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annots[i]);
PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
if (subType.Equals(PdfName.HIGHLIGHT))
{
Console.Write("HighLight at {0} with {1}\n", annotationDic.GetAsArray(PdfName.RECT), annotationDic.GetAsArray(PdfName.QUADPOINTS));
}
}
}
}
reports
HighLight at [224.65, 654.03, 251.08, 662.03] with [221.65, 654.03, 251.08, 654.03, 221.65, 663.03, 251.08, 663.03]
HighLight at [80.9, 574.13, 107.28, 582.13] with [77.9, 574.13, 107.28, 574.13, 77.9, 583.13, 107.28, 583.13]
HighLight at [209.3, 544.33, 235.67, 552.33] with [206.3, 544.33, 235.67, 544.33, 206.3, 553.33, 235.67, 553.33]
In particular those values are not null
as the OP claims in his comment
null value only i get for PdfArray annots = pageDict.GetAsArray(PdfName.QUADPOINTS) and annotationDic.GetAsArray(PdfName.RECT)
If I were the OP, I'd add private data to the annotations I create which contain the highlighted phrase. When he wants to remove the annotations for a given phrase, he can simply check that private data.
Text extraction, even from a limited area, is a very costly operation as the page content stream and a possible multitude of form xobject streams have to be parsed.
The OP wants to remove the annotations in this loop:
for (int i = 0; i < annots.Size; ++i)
{
PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annots[i]);
PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
if (subType.Equals(PdfName.HIGHLIGHT))
{
PdfString str = annots.GetAsString(i);
annots.Remove(i);
}
}
The problem: If he is at index i
and removes this annotation, the former i+1
st annotation becomes the i
th one. As the next annotation to check, though, is the now i+1
st, that former i+1
st annotation will not be checked or removed.
Upvotes: 1