Remove highlighted area in pdf using iTextSharp

Question

I highlighted the word in pdf using the code in the answer to the following question: Highlight words in a pdf using itextsharp, not displaying highlighted word in browser

Now I want to know how to remove those highlighted rectangles using iTextSharp.

private void RemovehighlightPDFAnnotation(string outputFile, string highLightFile, int pageno, string highLightedText)
{
    PdfReader reader = new PdfReader(outputFile);
    using (FileStream fs = new FileStream(highLightFile, FileMode.Create, FileAccess.Write, FileShare.None))
    {
        using (PdfStamper stamper = new PdfStamper(reader, fs))
        {                
            PdfDictionary pageDict = reader.GetPageN(pageno);                
            PdfArray annots = pageDict.GetAsArray(PdfName.ANNOTS);                
            if (annots != null)
            {
                for (int i = 0; i < annots.Size; ++i)                   
                {
                    PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annots[i]);
                    PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);                                               
                    if (subType.Equals(PdfName.HIGHLIGHT))
                    {
                        PdfString str  = annots.GetAsString(i);
                        if(str==highLightedText)
                        {
                                annots.Remove(i); 
                        }                          

                    }
                }                  

            }
        }
    }

It removes all annotation but i want to remove particular annotation. Suppose i highlighted united states and Patent Application Publication in page no 1, now i want to remove united states alone. I will pass the text united states.

I refered this answer. In that, to get the highlighted text, you need to get the coordinates stored in the Highlight annotation (stored in the QuadPoints array) and you need to use these coordinates to parse the text that is present in the page content at those coordinates.

mkl · Accepted Answer

Getting the highlighted annotation coordinates

As the OP clarified, he actually wants to

get the highlighted annotation coordinates

to extract the text from that area, check whether it matches the phrase in question, and (if it does) remove the annotation.

As the code in question always only marks a single rectangle with each annotation and chose the rectangle to only contain the text in question, he can simply use the annotation rectangle

annotationDic.GetAsArray(PdfName.RECT)

In a more generic case (i.e. for highlight annotations starting on the end of one line and ending at the start of the next), he'd need to check the quad points

annotationDic.GetAsArray(PdfName.QUADPOINTS)

which describe a set of quadrilaterals.

E.g. in case of the sample from the referenced question (highlighting the occurrence of the word "support" on the third document page of the OP's sample PDF), the method

private void ReportHighlightPDFAnnotation(string highLightFile, int pageno)
{
    PdfReader reader = new PdfReader(highLightFile);
    PdfDictionary pageDict = reader.GetPageN(pageno);
    PdfArray annots = pageDict.GetAsArray(PdfName.ANNOTS);
    if (annots != null)
    {
        for (int i = 0; i < annots.Size; ++i)
        {
            PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annots[i]);
            PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
            if (subType.Equals(PdfName.HIGHLIGHT))
            {
                Console.Write("HighLight at {0} with {1}
", annotationDic.GetAsArray(PdfName.RECT), annotationDic.GetAsArray(PdfName.QUADPOINTS));
            }
        }
    }
}

reports

HighLight at [224.65, 654.03, 251.08, 662.03] with [221.65, 654.03, 251.08, 654.03, 221.65, 663.03, 251.08, 663.03]
HighLight at [80.9, 574.13, 107.28, 582.13] with [77.9, 574.13, 107.28, 574.13, 77.9, 583.13, 107.28, 583.13]
HighLight at [209.3, 544.33, 235.67, 552.33] with [206.3, 544.33, 235.67, 544.33, 206.3, 553.33, 235.67, 553.33]

In particular those values are not null as the OP claims in his comment

null value only i get for PdfArray annots = pageDict.GetAsArray(PdfName.QUADPOINTS) and annotationDic.GetAsArray(PdfName.RECT)

An alternative approach

If I were the OP, I'd add private data to the annotations I create which contain the highlighted phrase. When he wants to remove the annotations for a given phrase, he can simply check that private data.

Text extraction, even from a limited area, is a very costly operation as the page content stream and a possible multitude of form xobject streams have to be parsed.

A warning on loop design

The OP wants to remove the annotations in this loop:

for (int i = 0; i < annots.Size; ++i)                   
{
    PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annots[i]);
    PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);                                               
    if (subType.Equals(PdfName.HIGHLIGHT))
    {
        PdfString str  = annots.GetAsString(i);
        annots.Remove(i);                           
    }
}

The problem: If he is at index i and removes this annotation, the former i+1^st annotation becomes the i^th one. As the next annotation to check, though, is the now i+1^st, that former i+1^st annotation will not be checked or removed.

Remove highlighted area in pdf using iTextSharp

Answers (1)

Getting the highlighted annotation coordinates

An alternative approach

A warning on loop design

Related Questions