Remove object in PDF with iTextSharp and save

Question

This is a case of OCR gone wrong. I need to remove the hidden text from a PDF and I'm having a hard time figuring out how to do it.

The hidden text resides in an area always named /QuickPDFsomething which is under and /XObject dictionary that resides in the page's /Resources dictionary.

I have tried these two things and neither has worked so I'm clearly doing something wrong.

Option 1 - Kill obj - The PDF won't open in Acrobat and states, 'An error exists on this page. Acrobat may not display the page correctly' but it looks ok. Pitstop pukes with 'Critical parser failure: XObject resource missing'.

PdfReader.KillIndirect(obj);
oPdfFile.GetPdfReader().RemoveUnusedObjects();
var stamper = new PdfStamper(oPdfFile.GetPdfReader(), new FileStream(@"C:	emp.pdf", FileMode.Create));
stamper.Close();

Option 2 - CleanupProcessor - Throws an exception about 'A Graphics object cannot be created from an image that has an indexed pixel format'.

var stamper = new PdfStamper(oPdfFile.GetPdfReader(), new FileStream(@"C:	emp.pdf", FileMode.Create));
var cleanupLocations = new List();
var pageRect = oPdfFile.GetPdfReader().GetCropBox(1);
cleanupLocations.Add(new PdfCleanUpLocation(1, pageRect));
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanupLocations, stamper);
cleaner.CleanUp();
stamper.Close();

I'd like to remove the /QuickPDF object (41 0 R, in this image) as well as remove it from the content stream that calls it with /QuickPDF Do.

Unfortunately I cannot provide the PDF.

Any tips on how to do this?

Darren · Accepted Answer

I hate to answer my own question but I wanted to share the solution I found in case others need it.

After playing around with this for a couple days i figured out that Option 1 above would indeed remove the object and that the exception that I was getting from PitStop was because the content stream had a reference to the /QuickPDF XObject.

So I tried following @mkl's solution here Removing Watermark from PDF iTextSharp but it kept putting unwanted data in the content stream that rotated my PDF.

So then I found @Chris's solution here Removing Watermark from a PDF using iTextSharp and it seems to work although I'm not sure how stable this solution will be.

This is my solution for removing /QuickPDF from the content stream:

int numPages = oPdfFile.GetPdfReader().NumberOfPages;
int pgNumber = 1;

PdfDictionary page = oPdfFile.GetPdfReader().GetPageN(pgNumber);
PdfArray contentarray = page.GetAsArray(PdfName.CONTENTS);
PRStream stream;
string content;
if (contentarray != null)
{
    //Loop through content
    for (int j = 0; j < contentarray.Size; j++)
    {
        stream = (PRStream)contentarray.GetAsStream(j);
        content = Encoding.ASCII.GetString(PdfReader.GetStreamBytes(stream));
        string[] tokens = content.Split('
');
        for (int i = 0; i< tokens.Length; i++)
        {
            if (tokens[i].Contains("/QuickPDF"))
            {
                tokens[i] = string.Empty;
            }
        }

        string outstr = string.Join("
", tokens.Select(p => p).ToArray());
        byte[] outbytes = Encoding.ASCII.GetBytes(outstr);
        stream.SetData(outbytes);
    }
}

Remove object in PDF with iTextSharp and save

Answers (1)

Related Questions