Reputation: 1571
I have a PDF file from which I want to obtain some information from its metada. To do so, I follow the follwoing procedure:
from PyPDF2 import PdfFileReader
mypath = "your_pdf_file.pdf"
pdf_toread = PdfFileReader(open(mypath, 'rb'))
pdf_info = pdf_toread.getDocumentInfo()
For the document at hand the output is:
Out[230]:
{'/CrossmarkDomainExclusive': 'true',
'/CreationDate': "D:20181029074117+05'30'",
'/CrossMarkDomains#5B2#5D': 'elsevier.com',
'/Author': 'Nicola Gennaioli',
'/Creator': 'Elsevier',
'/ElsevierWebPDFSpecifications': '6.5',
'/Subject': 'Journal of Monetary Economics, 98 (2018) 98-113. doi:10.1016/j.jmoneco.2018.04.011',
'/CrossmarkMajorVersionDate': '2010-04-23',
'/CrossMarkDomains#5B1#5D': 'sciencedirect.com',
'/robots': 'noindex',
'/ModDate': "D:20181029074135+05'30'",
'/AuthoritativeDomain#5B1#5D': 'sciencedirect.com',
'/Keywords': 'Sovereign Risk; Sovereign Default; Government Bonds',
'/doi': '10.1016/j.jmoneco.2018.04.011',
'/Title': 'Banks, government Bonds, and Default: What do the data Say?',
'/AuthoritativeDomain#5B2#5D': 'elsevier.com',
'/Producer': 'Acrobat Distiller 10.1.10 (Windows)'}
I found out, however, that that the PyPDF2 library does not have an attribute to "access" the information for the /Keywords
part. That is, this bit of output:
'/Keywords': 'Sovereign Risk; Sovereign Default; Government Bonds',
So, I want some help on how I could get this information of the metadata output [in this example: Sovereign Risk; Sovereign Default; Government Bonds
].
To reproduce the output I am sharing a link to the document
Doing for example
Update:
print(pdf_info.title)
Banks, government Bonds, and Default: What do the data Say?
print(pdf_info.subject)
Journal of Monetary Economics, 98 (2018) 98-113. doi:10.1016/j.jmoneco.2018.04.011
But when I am trying to do something similar for the /Keywords
part I am getting the follwoing attribute error:
pdf_info.keywords
Traceback (most recent call last):
File "<ipython-input-295-3852401ef983>", line 1, in <module>
pdf_info.keywords
AttributeError: 'DocumentInformation' object has no attribute 'keywords'
Upvotes: 3
Views: 5761
Reputation: 22457
The key /Keywords
is actually present in the dictionary returned by getDocumentInfo
, so you don't have to do anything special (except first testing if it is there or wrap this in a try
, in case it is not present in another file):
from PyPDF2 import PdfFileReader
mypath = "../Downloads/banks_gov_bonds_default.pdf"
pdf_toread = PdfFileReader(open(mypath, 'rb'))
pdf_info = pdf_toread.getDocumentInfo()
if '/Keywords' in pdf_info:
print (pdf_info['/Keywords'])
prints
Sovereign Risk; Sovereign Default; Government Bonds
which indeed are the keywords in the field inside your sample PDF.
The other option is to add keywords
to the exposed PDF properties by editing pdf.py
inside the PYPDF2 folder where your pip placed it. You can find the creation of the title
, creator
, author
and some more properties in the class DocumentInformation
, somewhere around line 2781 in my version. The creation of all of these properties follow a simple scheme and so adding one is no problem at all:
keywords = property(lambda self: self.getText("/Keywords"))
"""Read-only property accessing the document's **producer**.
If the document was converted to PDF from another format, this is
the name of the application (for example, OSX Quartz) that converted
it to PDF. Returns a unicode string (``TextStringObject``)
or ``None`` if the producer is not specified."""
keywords_raw = property(lambda self: self.get("/Keywords"))
"""The "raw" version of producer; can return a ``ByteStringObject``."""
(I added keywords_raw
only because the other properties did so as well. I can't tell off-hand what these are for, though.)
After that your code actually works:
from PyPDF2 import PdfFileReader
mypath = "../Downloads/banks_gov_bonds_default.pdf"
pdf_toread = PdfFileReader(open(mypath, 'rb'))
pdf_info = pdf_toread.getDocumentInfo()
print (pdf_info.keywords)
Result, again:
Sovereign Risk; Sovereign Default; Government Bonds
Upvotes: 3