CCSoftBarth
CCSoftBarth

Reputation: 1

Extract xml from ZUGFeRD PDF with Ghostscript

We would like to automate the processing of Zugferd invoices. Is there a way to extract and save the xml files embedded in the PDF using Ghostscript?

Upvotes: 0

Views: 1647

Answers (2)

K J
K J

Reputation: 11739

As mentioned by KenS, Ghostscript can help assemble Zugferd files but not extract the contents. Below we can see those contents in the source xml (lower) and a good? PDF where the plain text is visible (upper part of image is PDF viewed in WordPad), and can be easily extracted as text. However nothing about PDF extraction is reliable since the format of one PDF is rarely the same as the next unless you make it so.

Many PDF readers have the ability to export such attachments as the source file and many PDF libraries will allow for extraction of the named file in a scripted fashion.

enter image description here

The samples above are from currently very up to date Open Source Java application https://www.mustangproject.org/

For very simple cross platform use there is pdfdetach which can save any attachments by name or all attachments

enter image description here

Answer

Using any programming tool it is possible to export the XML text AFTER the file is password decrypted thus suitable for decoding and also "filestreams" must be decompressed.

GhostScript has the ability to accept a required password (not usually required for this type of file). However the XML must be searchable as plain text, so try seeking XMLL in the PDF via Windows FindSTR and similar filter parameters in Mac or Linux. If the stream is not un/Flated or not plain text other decoding for extraction may be required.

Then any text extraction method can see and export from the stream keyword that starts the XMl to the endstream keyword at the end of the XML.

To show if a file has a fingerprint of such attachment we could seek "XML"

>type Basic_Einfach.pdf | Findstr /i "xml"
/Names [ (factur\055x\056xml) 23 0 R ]
/Subtype /XML
/Subtype /text#2Fxml
/F (factur\055x\056xml)
/UF (factur\055x\056xml)

>

That shows that some form of decoding and decompression with any other PDF tool is needed as we did NOT see either of these type of entries:

<?xml version='1.0' encoding='UTF-8' ?>
<rsm:CrossIndustryInvoice xmlns:rsm="urn:un:unece:uncefact:data:standard:CrossIndustryInvoice:100" xmlns:qdt="urn:un:unece:uncefact:data:standard:QualifiedDataType:100" xmlns:ram="urn:un:unece:uncefact:data:standard:ReusableAggregateBusinessInformationEntity:100" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:udt="urn:un:unece:uncefact:data:standard:UnqualifiedDataType:100">

Upvotes: 0

Guido Flohr
Guido Flohr

Reputation: 2340

Install pdftk and then:

$ pdftk file.pdf unpack_files output attachments-dir

Without the arguments output DIRECTORY, the attachments are saved in the current working directory.

Upvotes: 1

Related Questions