Reputation: 1377
I am writing an application to deal with some legacy PDF xfa files.
With PyPDF2, I open xfa pdfs and extract the xfa data of human-filled PDFs. However, my application also sometimes "fills in" these xfa forms using C# 's (deprecated) iTextSharp library.
My problem is that PyPDF seems unable to properly open the pdf files after I've filled them using the C# component. A human-filled pdf can be extracted using the code below. However, once the same file was filled in using my C# component (file here), PyPDF can't seem to access its xfa layer even though the PDF still seems to work fine in Adobe's Reader.
(Note: both these files, being XFA PDFs won't open in all PDF tools. e.g. Google Chrome does not open them. But they should open fine with an Adobe tool.)
Is there maybe something that can be done with PyPDF2 to overcome this? Or: should I change something on the iText sharp side? Is there a way to manipulate the xfa values from Python (thereby eliminating the need for the C# component) ?
Here is the python code, followed by the C# code.
import PyPDF2 as pypdf
import xml.etree.ElementTree as ET
def find_in_dict(needle, haystack):
""" this is some magic someone on the internet shared for getting
xfa data out of an xml object. To be honest I don't understand
100 % how it works, but it works well.
Args:
needle: tag you are looking for (for us: '/XFA')
haystack: the PyPDF resolvedObjects
Returns:
XFA data in an xml form
"""
for key in haystack.keys():
try:
value = haystack[key]
except:
continue
if key == needle:
return value
if isinstance(value, dict):
x = find_in_dict(needle, value)
if x is not None:
return x
def extract_data_from_pdf(file_location: str, field_name_list: list[str]) -> dict:
""" extracts data from an xfa pdf
Args:
file_location: path to target pdf file as string
field_name_list: list of names (as string) of fields inside the xfa pdf
Returns:
output: dict with provided pdf field names as keys and pdf field values as values
"""
with open(file_location, 'rb') as f:
pdf = pypdf.PdfFileReader(f)
xfa = find_in_dict('/XFA', pdf.resolvedObjects)
xml = xfa[13].getObject().getData() # I had to use trail and error to see it's the 13th element.
# For different files it can be a different element.
# However, the error I get is that xfa is None,
# meaning no /XFA tags was found in the preceding line
output = {}
xml_string = xml.decode("utf-8")
root = ET.fromstring(xml_string)
for key in field_name_list:
try:
output[key] = root.findall(".//" + key)[0].text
except IndexError:
print(f"\n\n ERROR in extract_data_from_pdf! Can't find field '{key}'!")
exit()
return output
if __name__ == "__main__":
field_list = ["field_1", "field_2"]
good_result = extract_data_from_pdf("test_pdf - initial.pdf", field_list) # works fine
print(good_result)
# bad_result = extract_data_from_pdf("test_pdf - processed.pdf", field_list) # does not work :(
# print(bad_result)
Here is the C# code that I use to fill in the pdf, which then seems to break it for PyPDF:
using System;
using System.IO;
using iTextSharp.text.pdf;
using System.Xml;
namespace xfa_editor
{
class Program
{
static void Main()
{
string[] args = Environment.GetCommandLineArgs();
string[] fields = args[2].Replace("\'", "").Split(',');
string[] values = args[3].Replace("\'", "").Split(',');
Console.WriteLine("fields[0] is " + fields[0]);
byte[] result = EditFieldsXFAImproved(args[1], fields, values);
File.WriteAllBytes(args[1], result);
Console.WriteLine("In " + args[1] + " The value of the field '" + args[2] + "' has been set to " + args[3]);
}
public static byte[] EditFieldsXFAImproved(string path, string[] xpath, string[] values)
{
PdfReader reader = new PdfReader(path);
using (MemoryStream ms = new MemoryStream())
{
PdfStamper stamper = new PdfStamper(reader, ms);
AcroFields form = stamper.AcroFields;
XfaForm xfa = form.Xfa;
XmlNode a = xfa.DatasetsNode;
for (int i = 0; i < xpath.Length; i++)
{
XmlNodeList hits = a.SelectNodes("//" + xpath[i]);
foreach (XmlNode hit in hits)
{
if (hit.NodeType == XmlNodeType.Element)
{
hit.InnerText = values[i];
}
}
}
xfa.Changed = true;
stamper.Close();
reader.Close();
return ms.ToArray();
}
}
}
}
Upvotes: 0
Views: 1613
Reputation: 1377
I found the answer here.
Turns out I could fix the C# code by changing
PdfStamper stamper = new PdfStamper(reader, ms);
to
PdfStamper stamper = new PdfStamper(reader, ms, '\0', true);
Upvotes: 1