Reputation: 263
I would like to use Python to retrieve metadata stored in PDF files. I am trying to use Python xmptools
, but find that I cannot extract all the metadata. For example, this paper is available in PDF format. I have the following script that tries to extract the metadata
from xmptools import XMPMetadata, DC
xmp = XMPMetadata.fromFile("Leonard_2015_Comment_on_‘Dimensionless_units_in_the_SI’.pdf")[0]
print( xmp.getContainerItems(DC.publisher) )
This works fine. The result is [rdflib.term.Literal('IOP Publishing')]
. However, if I change the last line to
print( xmp.getContainerItems(DC.identifier) )
then I get None
as a result.
I think this may be due to the XML inside the PDF file. The data concerned with these two queries are
<dc:publisher>
<rdf:Bag>
<rdf:li>IOP Publishing</rdf:li>
</rdf:Bag>
</dc:publisher>
<dc:identifier>doi:10.1088/0026-1394/52/4/613</dc:identifier>
In the case of publisher
, the information is wrapped in RDF tags, but that is not the case for identifier
.
Is there a way for xmptools
to read simple entries where RDF tags have not been used?
Upvotes: 0
Views: 782
Reputation: 12662
pypdf is able to access pdf metadata. Specific attributes are listed out of the box or the root minidom object can be obtained and iterated
from pypdf import PdfReader
fd = open("/home/lmc/tmp/shapes.pdf", "rb")
reader = PdfReader(fd)
meta = reader.xmp_metadata
meta.dc_identifier
Result:
'doi:1.1.1.1.1.'
Getting the root minidom object
meta = reader.xmp_metadata
root = meta.rdf_root
print(type(root))
print(root.toxml())
Result
<class 'xml.dom.minidom.Element'>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/" rdf:about="">
<pdfaid:part>3</pdfaid:part>
<pdfaid:conformance>B</pdfaid:conformance>
</rdf:Description>
<rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
<!-- redacted -->
<xmp:MetadataDate>2024-05-06T19:20:03-03:00</xmp:MetadataDate>
</rdf:Description>
</rdf:RDF>
Getting specific elements
for node in root.getElementsByTagName('xmp:ModifyDate'):
print(node.firstChild.nodeValue, node.toxml())
for node in root.getElementsByTagNameNS('http://ns.adobe.com/xap/1.0/', 'ModifyDate'):
print(node.firstChild.nodeValue, node.toxml())
result
2024-05-06T19:20:03-03:00 <xmp:ModifyDate>2024-05-06T19:20:03-03:00</xmp:ModifyDate>
2024-05-06T19:20:03-03:00 <xmp:ModifyDate>2024-05-06T19:20:03-03:00</xmp:ModifyDate>
Additionally, using pyxml2xpath, get all xpath expressions from metadata (XML) to know what elements are present without parsing element by element
# pip install pyxml2xpath==0.3.3
from xml2xpath import xml2xpath
tree, ns, xmap = xml2xpath.fromstring(root.toxml())
# get specific element
mod_date = tree.xpath('//rdf:Description/xmp:ModifyDate', namespaces=ns)[0]
print('ModifyDate', mod_date.text)
# print all found elements
xml2xpath.print_xpaths(xmap, 'all')
Result (redacted)
ModifyDate 2024-05-06T19:20:03-03:00
/rdf:RDF
/rdf:RDF/rdf:Description[1]
/rdf:RDF/rdf:Description[1]/@{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about
/rdf:RDF/rdf:Description[1]/pdfaid:part
/rdf:RDF/rdf:Description[1]/pdfaid:conformance
/rdf:RDF/rdf:Description[2]
/rdf:RDF/rdf:Description[2]/@{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about
/rdf:RDF/rdf:Description[2]/dc:format
/rdf:RDF/rdf:Description[2]/dc:title
/rdf:RDF/rdf:Description[2]/dc:rights
/rdf:RDF/rdf:Description[2]/dc:rights/rdf:Alt
/rdf:RDF/rdf:Description[2]/dc:rights/rdf:Alt/rdf:li
/rdf:RDF/rdf:Description[2]/dc:rights/rdf:Alt/rdf:li/@{http://www.w3.org/XML/1998/namespace}lang
/rdf:RDF/rdf:Description[2]/dc:type
/rdf:RDF/rdf:Description[3]
/rdf:RDF/rdf:Description[3]/@{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about
/rdf:RDF/rdf:Description[3]/pdf:Producer
/rdf:RDF/rdf:Description[3]/pdf:Keywords
/rdf:RDF/rdf:Description[3]/pdf:PDFVersion
/rdf:RDF/rdf:Description[4]
/rdf:RDF/rdf:Description[4]/@{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about
/rdf:RDF/rdf:Description[4]/xmp:CreatorTool
/rdf:RDF/rdf:Description[4]/xmp:CreateDate
/rdf:RDF/rdf:Description[4]/xmp:ModifyDate
/rdf:RDF/rdf:Description[4]/xmp:MetadataDate
Found 38 xpath expressions for elements
Found 7 xpath expressions for attributes
Upvotes: 3
Reputation: 11744
I am simply going to provide a method to extract a full XMP from a PDF, personally I find the XML method of ad-hoc nesting lines is less than useful but with python or any other XMP editor you can convert the XMP.XML into more useful output.
So Xross platform one PDF tool that can extract PDF objects is MuTool (the basis of PyMuPDF)
I use windows so a cmd file to pull or custom manipulate the data is
getxmp.cmd
@echo off
set "mutool=C:\Users\lez\Downloads\Apps\PDF\mupdf\1.20.0\Mutool.exe"
"%mutool%" show "%~1" | find "/Root" >"%temp%\temp$.tmp"
set /P Object$=<"%temp%\temp$.tmp"
for /F "tokens=2 delims= " %%R in ("%Object$%") do ("%mutool%" show "%~1" %%R | find "/Metadata") >"%temp%\temp$.tmp"
if errorlevel==1 echo /Metadata not found&& type "%temp%\temp$.tmp"&& pause&& del "%temp%\temp$.tmp"&& exit /b
set /P Object$=<"%temp%\temp$.tmp"
for /F "tokens=2 delims=:" %%C in ('chcp') do set /a oldcp =%%C&&chcp 65001 >nul
for /F "tokens=2 delims= " %%M in ("%Object$%") do "%mutool%" show -b -o "%~1.xmp" "%~1" %%M
del "%temp%\temp$.tmp"
REM post processing as desired here exclusions
if not [%1]==[] type "%~1.xmp" ^
| find /i /v "xmpmeta" ^
| find /i /v "xmlns" ^
| find /i /v "bag" ^
| find /i /v "seq" ^
| find /i /v "xpacket" ^
| find /i /v "rdf:RDF" ^
| find ":"
if not [%2]==[] type "%~1.xmp" | find /i /v "xmpmeta" | find /i /v "xmlns" | find /i /v "bag" | find /i /v "seq" | find /i "%2"
chcp %oldcp% >nul
pause
That allows for drag and drop a pdf to get data in a name.pdf.xmp side car file or via console add other switches and edit result as desired. However every PDF can have different XMP structure or no /Metadata.
Another useful Xross platform XMP Metadata tool is ExifTool
ExifTool will extract XMP information even if it is not listed in
This is highly configurable for drag and drop so if the filename is exiftool(-a -U -g1 -w xmp).exe
the result will be structured like this
comment.xmp (perhaps I should have not used the xmp ext !)
---- ExifTool ----
ExifTool Version Number : 12.84
---- System ----
File Name : Comment.pdf
Directory : C:/Users/lez/Downloads/Apps/PDF/mupdf/1.20.0
File Size : 295 kB
Zone Identifier : Exists
File Modification Date/Time : 2024:05:07 03:16:36+01:00
File Access Date/Time : 2024:05:07 15:39:46+01:00
File Creation Date/Time : 2024:05:07 03:16:33+01:00
File Permissions : -rw-rw-rw-
---- File ----
File Type : PDF
File Type Extension : pdf
MIME Type : application/pdf
---- PDF ----
PDF Version : 1.4
Linearized : No
Author : B P Leonard
Create Date : 2015:07:30 20:24:16+05:30
Creator : Adobe InDesign CS5.5 (7.5)
Cross Mark Domains 1 : iop.org
Cross Mark Major Version Date : 2015-8-3
Crossmark Domain Exclusive : true
Modify Date : 2015:08:04 11:18:41+01:00
Producer : Adobe PDF Library 9.9
Subject : Metrologia, 52 (2015) 613. doi: 10.1088/0026-1394/52/4/613
Title : Comment on ‘Dimensionless units in the SI’
Trapped : False
Doi : 10.1088/0026-1394/52/4/613
Robots : noindex
Rgid : PB:280873495_AS:314881943769089@1452085113684
Page Count : 6
---- XMP-x ----
XMP Toolkit : Adobe XMP Core 5.4-c005 78.147326, 2012/08/23-13:03:03
---- XMP-pdfx ----
Doi : 10.1088/0026-1394/52/4/613
Robots : noindex
Cross Mark Major Version Date : 2015-8-3
Crossmark Domain Exclusive : true
Cross Mark Domains : iop.org
---- XMP-xmp ----
Creator Tool : Adobe InDesign CS5.5 (7.5)
Create Date : 2015:07:30 20:24:16+05:30
Modify Date : 2015:08:04 11:18:41+01:00
Metadata Date : 2015:08:04 11:18:41+01:00
---- XMP-xmpRights ----
Marked : True
---- XMP-dc ----
Format : application/pdf
Identifier : doi:10.1088/0026-1394/52/4/613
Title : Comment on ‘Dimensionless units in the SI’
Creator : B P Leonard
Publisher : IOP Publishing
Description : Metrologia, 52 (2015) 613. doi: 10.1088/0026-1394/52/4/613
---- XMP-prism ----
Aggregation Type : journal
Publication Name : Metrologia
Copyright : © 2015 BIPM & IOP Publishing Ltd
ISSN : 0026-1394
Starting Page : 613
Ending Page : 616
Page Range : 613
Digital Object Identifier : 10.1088/0026-1394/52/4/613
URL : http://dx.doi.org/10.1088/0026-1394/52/4/613
---- XMP-crossmark_1_ ----
Major Version Date : 2015-8-3
Crossmark Domain Exclusive : true
Doi : 10.1088/0026-1394/52/4/613
Cross Mark Domains : iop.org
---- XMP-pdf ----
Producer : Adobe PDF Library 9.9
Trapped : False
---- XMP-xmpMM ----
Document ID : uuid:411519c7-630a-4745-9153-f20c68b14cfe
Instance ID : uuid:4152b40e-9ef3-4b46-8557-a7d2dbfa40b9
exiftool has so many options its hard to say what you could possibly need. However it can be set to output csv or remove duplicate outputs but, just as a taster here is one command file and its output (with duplicates).
mymeta.cmd
@echo off
set "exif=C:\Users\lez\Downloads\Apps\PDF\mupdf\1.20.0\exiftool(-a -U -g1 -w! .pdf.xmp.txt).exe"
"%exif%" "%~1" 2>nul
type "%~1.xmp.txt" |find /i "title" >metadata.txt
type "%~1.xmp.txt" |find /i "author" >>metadata.txt
type "%~1.xmp.txt" |find /i "publisher" >>metadata.txt
type "%~1.xmp.txt" |findstr /r /i "^identifier" >>metadata.txt
type "%~1.xmp.txt" |find /i "producer" >>metadata.txt
Thus to find the DC publisher you export the line
Publisher : IOP Publishing
Upvotes: 0
Reputation: 3417
I don't know xmptools, but maybe pdfminer-six could help?
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
fp = open('example.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
print(doc.info)
Upvotes: 0