Djoni Djonovic
Djoni Djonovic

Reputation: 11

Extracting embedded XML File from PDF ZUGFERD

I need to get embedded xml from my pdf file, I read that the embedded file should be in /EmbeddedFiles if i try to get these dates through this site https://eforms.aloaha.com/zugferd.aspx Then I got an xml file, but if I try to get this data through my function, the program write -> The specified PDF file does not contain an embedded ZUGFeRD XML file.

this is what i tried

 def extract_zugferd_xml(pdf_file_path):
        """
        Extracts the ZUGFeRD XML from a PDF file and returns it as a string.
    
        Args:
            pdf_file_path (str): The path to the PDF file to extract the XML from.
    
        Returns:
            str: The ZUGFeRD XML as a string.
        """
        # Open the PDF file in binary mode using PyPDF2
        with open(pdf_file_path, 'rb') as pdf_file:
            # Create a PDF reader object
            pdf_reader = PyPDF2.PdfReader(pdf_file)
    
            # Get the number of pages in the PDF file
            # num_pages = pdf_reader.getNumPages()
            num_pages = len(pdf_reader.pages)
    
            # Loop through each page in the PDF file
            for page_num in range(num_pages):
                # Get the page object for the current page
                # page = pdf_reader.getPage(page_num)
                page =  pdf_reader.pages[page_num]
    
                print("PAGE ", page)
                print("MediaBox ", page['/MediaBox'])
                print("Parent ", page['/Parent']['/Kids'][0])
                print("Parent type", type(page['/Parent']['/Kids']))
    
                # Create a PDF reader object
                pdf_reader = PyPDF2.PdfReader(pdf_file)
    
                # Get the indirect reference object
                indirect_ref = page['/Parent']['/Kids'][0]
    
                # Get the referenced object directly
                ref_obj = pdf_reader.get_object(indirect_ref)
    
                # Print the content of the object
                print("ref_obj ", ref_obj)
    
                # Check if the page has any embedded files
                if '/EmbeddedFiles' in page:
                    # Get the embedded files dictionary
                    embedded_files = page['/EmbeddedFiles']
    
                    # Loop through each entry in the embedded files dictionary
                    for file_name, file_obj in embedded_files.items():
                        # Check if the file is a ZUGFeRD XML file
                        if file_name.endswith('.xml') and 'ZUGFeRD' in file_obj.getData():
                            # Extract and return the XML data as a string
                            return file_obj.getData().decode('utf-8')
    
        # If the function reaches this point, it means the PDF file does not contain an embedded ZUGFeRD XML file
        raise ValueError('The specified PDF file does not contain an embedded ZUGFeRD XML file.')

print is ->

PAGE  {'/Type': '/Page', '/MediaBox': [0, 0, 595, 842], '/Rotate': 0, '/Parent': IndirectObject(3, 0, 4370055712), '/Resources': {'/ProcSet': ['/PDF', '/Text'], '/Font': IndirectObject(21, 0, 4370055712)}, '/Contents': IndirectObject(11, 0, 4370055712)}
MediaBox  [0, 0, 595, 842]
Parent  IndirectObject(10, 0, 4370055712)
Parent type <class 'PyPDF2.generic._data_structures.ArrayObject'>

Why? or where is /EmbeddedFiles in PDF

Upvotes: 0

Views: 176

Answers (2)

Djoni Djonovic
Djoni Djonovic

Reputation: 11

I found a solution it turns out that embedded pdf files are hidden in specific attributes of pdf files in my mz files you can find them at the path /Root/Names/EmbeddedFiles/Names or /Root/Names/EmbeddedFiles/Kids here is the full code

def get_attachments(reader):
    """
          Retrieves the file attachments of the PDF as a dictionary of file names
          and the file data as a bytestring.
          :return: dictionary of filenames and bytestrings
          """
    try:
        attachments = {}
        catalog = reader.trailer["/Root"]
        try:
            file_names = catalog['/Names']['/EmbeddedFiles']['/Names'] # korrect
        except (Exception,):
            file_names = catalog['/Names']['/EmbeddedFiles']['/Kids']
        # xml nam amd xml file
        if len(file_names) == 2:
            for f in file_names:
                if isinstance(f, str):
                    name = f
                    data_index = file_names.index(f) + 1
                    f_dict = file_names[data_index].get_object()
                    f_data = f_dict['/EF']['/F'].get_data()
                    attachments[name] = f_data

            return attachments
        # only xml file
        elif len(file_names) == 1:
            for f in file_names:
                data_index = file_names.index(f)
                f_dict = file_names[data_index].get_object()
                name = f_dict['/Names'][0]
                f_dict = f_dict['/Names'][1].get_object()    

                f_data = f_dict['/EF']['/F'].get_data()

                attachments[name] = f_data
            return attachments


    except Exception as e:
        print(f"Exception {e}")
        return {}


pdf_file = open('Testinvoice.pdf', 'rb')


# Create a PDF reader object
pdf_reader = PyPDF2.PdfReader(pdf_file)
xml_data = get_attachments(pdf_reader)

Upvotes: 0

K J
K J

Reputation: 11739

An attached XML File in a PDF invoice etc. is intended to be extracted by a user for verification. Such that there are matching visible entries in the PDF. in effect much like checking a visible signature is valid.

enter image description here

Thus the XML file is not related to any one PDF page but the overall containment.
IMHO easier if it was a zip! see note below*

To programmatically detach the XML from the PDF the easiest is use the OS shell to call all BULK PDF detachments.

Python usually includes a copy of PDF tools such as poppler PDFdetach.

Thus run OS command with paths as required

pdfdetach -save 1 invoice.pdf

dir fact*.xml

2024-12-06  15:50             9,069 factur-x.xml
  • Note all the times and dates are meaningless so the extracted XML is dated NOW The PDF file says was inserted 2013 but the content says it was written 25.6.2014. So which is correct ? And thus what is the point of embedment's ! the two separate files could have been sent easier at same time in a zip container.

Upvotes: 0

Related Questions