Remove Javascript from PDF using iTextSharp

Question

This seems like something that should be quick to do, but in practice there seems to be a problem. I have a bunch of PDF forms that include form fields and embedded javascript. I would like to remove the javascript code safely, but leave the PDF form fields intact.

So far I've been able to find lots of solutions, but all the solutions have either eliminated both the javascript and the form fields, or left both intact.

Here's solution A; it copies both form fields and javascript:

var pdfReader = new PdfReader(infilename);
using (MemoryStream memoryStream = new MemoryStream()) {
    PdfCopyFields copy = new PdfCopyFields(memoryStream);
    copy.AddDocument(pdfReader);
    copy.Close();
    File.WriteAllBytes(rawfilename, memoryStream.ToArray());
}

Alternately, I have solution B, that strips out both form fields and javascript:

Document document = new Document();
using (MemoryStream memoryStream = new MemoryStream()) {
    PdfWriter writer = PdfWriter.GetInstance(document, memoryStream);
    document.Open();
    document.AddDocListener(writer);
    for (int p = 1; p <= pdfReader.NumberOfPages; p++) {
        document.SetPageSize(pdfReader.GetPageSize(p));
        document.NewPage();
        PdfContentByte cb = writer.DirectContent;
        PdfImportedPage pageImport = writer.GetImportedPage(pdfReader, p);
        int rot = pdfReader.GetPageRotation(p);
        if (rot == 90 || rot == 270) {
            cb.AddTemplate(pageImport, 0, -1.0F, 1.0F, 0, 0, pdfReader.GetPageSizeWithRotation(p).Height);
        } else {
            cb.AddTemplate(pageImport, 1.0F, 0, 0, 1.0F, 0, 0);
        }
    }
    document.Close();
    File.WriteAllBytes(rawfile, memoryStream.ToArray());
}

Does anyone know how to modify either solution A or B to eliminate the javascript but leave the form fields in place?

EDIT: Solution code is here!

using (MemoryStream memoryStream = new MemoryStream()) {
    PdfStamper stamper = new PdfStamper(pdfReader, memoryStream);
    for (int i = 0; i <= pdfReader.XrefSize; i++) {
        object o = pdfReader.GetPdfObject(i);
        PdfDictionary pd = o as PdfDictionary;
        if (pd != null) {
            pd.Remove(PdfName.AA);
            pd.Remove(PdfName.JS);
            pd.Remove(PdfName.JAVASCRIPT);
        }
    }
    stamper.Close();
    pdfReader.Close();
    File.WriteAllBytes(rawfile, memoryStream.ToArray());
}

mkl · Accepted Answer

To manipulate a single PDF you should use the class PdfStamper and manipulate its contents, in your case iterating over the existing form fields and removing the JavaScript entries.

The iTextSharp sample AddJavaScriptToForm.cs corresponding to AddJavaScriptToForm.java from chapter 13 of iText in Action — 2nd Edition shows how JavaScript actions are added to fields, the central code being:

PdfStamper stamper = new PdfStamper(reader, ms);

AcroFields form = stamper.AcroFields;
AcroFields.Item fd = form.GetFieldItem("married");

PdfDictionary dictYes = (PdfDictionary) PdfReader.GetPdfObject(fd.GetWidgetRef(0));
PdfDictionary yesAction = ...;
dictYes.Put(PdfName.AA, yesAction);

Thus, to remove such JavaScript form field actions you have to iterate over all those PDF form fields and remove the /AA values in the associated dictionaries:

dictXXX.Remove(PdfName.AA);

EDIT: (provided by Ted Spence) Here is the final code that successfully removes javascript while leaving all form fields intact:

using (MemoryStream memoryStream = new MemoryStream())
{
    PdfStamper stamper = new PdfStamper(pdfReader, memoryStream);
    for (int i = 0; i <= pdfReader.XrefSize; i++)
    {
        PdfDictionary pd = pdfReader.GetPdfObject(i) as PdfDictionary;
        if (pd != null)
        {
            pd.Remove(PdfName.AA); // Removes automatic execution objects
            pd.Remove(PdfName.JS); // Removes javascript objects
            pd.Remove(PdfName.JAVASCRIPT); // Removes other javascript objects
        }
    }
    stamper.Close();
    pdfReader.Close();
    File.WriteAllBytes(rawfile, memoryStream.ToArray());
}

EDIT: (by mkl) The solution above is somewhat overachieving because it touches each and every indirect dictionary object. On the other hand it ignores inline dictionaries (I haven't checked the spec, though; maybe all /AA, /JS, and /JAVASCRIPT entries appear only in dictionaries which have to be indirect objects, or at least are de-referenced by this code).

If fulfilling this task was my job, I would try and access the objects possibly carrying JavaScript more specifically.

The advantage of this overachieving procedure might be, though, that even PDF objects are inspected which currently are not specified as carrying JavaScript but will be in later PDF versions.

Remove Javascript from PDF using iTextSharp

Answers (2)

Related Questions