Reputation: 1
We would like to automate the processing of Zugferd invoices. Is there a way to extract and save the xml files embedded in the PDF using Ghostscript?
Upvotes: 0
Views: 1647
Reputation: 11739
As mentioned by KenS, Ghostscript can help assemble Zugferd files but not extract the contents. Below we can see those contents in the source xml (lower) and a good? PDF where the plain text is visible (upper part of image is PDF viewed in WordPad), and can be easily extracted as text. However nothing about PDF extraction is reliable since the format of one PDF is rarely the same as the next unless you make it so.
Many PDF readers have the ability to export such attachments as the source file and many PDF libraries will allow for extraction of the named file in a scripted fashion.
The samples above are from currently very up to date Open Source Java application https://www.mustangproject.org/
For very simple cross platform use there is pdfdetach which can save any attachments by name or all attachments
Using any programming tool it is possible to export the XML text AFTER the file is password decrypted thus suitable for decoding and also "filestreams" must be decompressed.
GhostScript has the ability to accept a required password (not usually required for this type of file). However the XML must be searchable as plain text, so try seeking XMLL in the PDF via Windows FindSTR
and similar filter parameters in Mac or Linux. If the stream is not un/Flated or not plain text other decoding for extraction may be required.
Then any text extraction method can see and export from the stream
keyword that starts the XMl to the endstream
keyword at the end of the XML.
To show if a file has a fingerprint of such attachment we could seek "XML"
>type Basic_Einfach.pdf | Findstr /i "xml"
/Names [ (factur\055x\056xml) 23 0 R ]
/Subtype /XML
/Subtype /text#2Fxml
/F (factur\055x\056xml)
/UF (factur\055x\056xml)
>
That shows that some form of decoding and decompression with any other PDF tool is needed as we did NOT see either of these type of entries:
<?xml version='1.0' encoding='UTF-8' ?>
<rsm:CrossIndustryInvoice xmlns:rsm="urn:un:unece:uncefact:data:standard:CrossIndustryInvoice:100" xmlns:qdt="urn:un:unece:uncefact:data:standard:QualifiedDataType:100" xmlns:ram="urn:un:unece:uncefact:data:standard:ReusableAggregateBusinessInformationEntity:100" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:udt="urn:un:unece:uncefact:data:standard:UnqualifiedDataType:100">
Upvotes: 0
Reputation: 2340
Install pdftk
and then:
$ pdftk file.pdf unpack_files output attachments-dir
Without the arguments output DIRECTORY
, the attachments are saved in the current working directory.
Upvotes: 1