levraininjaneer
levraininjaneer

Reputation: 1377

pypdf unable to read xfa pdf file after it has been filled in using iTextSharp, even though it opens OK in acrobat

I am writing an application to deal with some legacy PDF xfa files.

With PyPDF2, I open xfa pdfs and extract the xfa data of human-filled PDFs. However, my application also sometimes "fills in" these xfa forms using C# 's (deprecated) iTextSharp library.

My problem is that PyPDF seems unable to properly open the pdf files after I've filled them using the C# component. A human-filled pdf can be extracted using the code below. However, once the same file was filled in using my C# component (file here), PyPDF can't seem to access its xfa layer even though the PDF still seems to work fine in Adobe's Reader.

(Note: both these files, being XFA PDFs won't open in all PDF tools. e.g. Google Chrome does not open them. But they should open fine with an Adobe tool.)

Is there maybe something that can be done with PyPDF2 to overcome this? Or: should I change something on the iText sharp side? Is there a way to manipulate the xfa values from Python (thereby eliminating the need for the C# component) ?

Here is the python code, followed by the C# code.

import PyPDF2 as pypdf
import xml.etree.ElementTree as ET


def find_in_dict(needle, haystack):
    """ this is some magic someone on the internet shared for getting
    xfa data out of an xml object. To be honest I don't understand
    100 % how it works, but it works well.

    Args:
        needle: tag you are looking for (for us: '/XFA')
        haystack: the PyPDF resolvedObjects

    Returns:
        XFA data in an xml form

    """

    for key in haystack.keys():
        try:
            value = haystack[key]
        except:
            continue
        if key == needle:
            return value
        if isinstance(value, dict):
            x = find_in_dict(needle, value)
            if x is not None:
                return x


def extract_data_from_pdf(file_location: str, field_name_list: list[str]) -> dict:
    """ extracts data from an xfa pdf

    Args:
        file_location:  path to target pdf file as string
        field_name_list:    list of names (as string) of fields inside the xfa pdf

    Returns:
        output:     dict with provided pdf field names as keys and pdf field values as values

    """

    with open(file_location, 'rb') as f:
        pdf = pypdf.PdfFileReader(f)
        xfa = find_in_dict('/XFA', pdf.resolvedObjects)
        xml = xfa[13].getObject().getData() # I had to use trail and error to see it's the 13th element.
                                            # For different files it can be a different element.
                                            # However, the error I get is that xfa is None,
                                            # meaning no /XFA tags was found in the preceding line 
    output = {}
    xml_string = xml.decode("utf-8")
    root = ET.fromstring(xml_string)

    for key in field_name_list:
        try:
            output[key] = root.findall(".//" + key)[0].text
        except IndexError:
            print(f"\n\n ERROR in extract_data_from_pdf! Can't find field '{key}'!")
            exit()

    return output


if __name__ == "__main__":
    field_list = ["field_1", "field_2"]
    good_result = extract_data_from_pdf("test_pdf - initial.pdf", field_list)  # works fine
    print(good_result)
    # bad_result = extract_data_from_pdf("test_pdf - processed.pdf", field_list)  # does not work :(
    # print(bad_result)

Here is the C# code that I use to fill in the pdf, which then seems to break it for PyPDF:

using System;
using System.IO;
using iTextSharp.text.pdf;
using System.Xml;


namespace xfa_editor
{
    class Program
    {

        static void Main()
        {
            string[] args = Environment.GetCommandLineArgs();
            string[] fields = args[2].Replace("\'", "").Split(',');
            string[] values = args[3].Replace("\'", "").Split(',');

            Console.WriteLine("fields[0] is " + fields[0]);

            byte[] result = EditFieldsXFAImproved(args[1], fields, values);
            File.WriteAllBytes(args[1], result);
            Console.WriteLine("In " + args[1] + " The value of the field '" + args[2] + "' has been set to " + args[3]);

        }


        public static byte[] EditFieldsXFAImproved(string path, string[] xpath, string[] values)
        {
            PdfReader reader = new PdfReader(path);

                using (MemoryStream ms = new MemoryStream())
            {
                PdfStamper stamper = new PdfStamper(reader, ms);
                AcroFields form = stamper.AcroFields;
                XfaForm xfa = form.Xfa;
                XmlNode a = xfa.DatasetsNode;

                 for (int i = 0; i < xpath.Length; i++)
                    {
                    XmlNodeList hits = a.SelectNodes("//" + xpath[i]);
                    foreach (XmlNode hit in hits)
                    {
                        if (hit.NodeType == XmlNodeType.Element)
                        {
                            hit.InnerText = values[i];
                        }

                    }
                }
                xfa.Changed = true;
                stamper.Close();
                reader.Close();

                return ms.ToArray();
            }
        }
    }
}

Upvotes: 0

Views: 1613

Answers (1)

levraininjaneer
levraininjaneer

Reputation: 1377

I found the answer here.

Turns out I could fix the C# code by changing

PdfStamper stamper = new PdfStamper(reader, ms);

to

PdfStamper stamper = new PdfStamper(reader, ms, '\0', true);

Upvotes: 1

Related Questions