msh855
msh855

Reputation: 1571

Extracting the keywords from PDF metadata in Python

I have a PDF file from which I want to obtain some information from its metada. To do so, I follow the follwoing procedure:

from PyPDF2 import PdfFileReader    
mypath = "your_pdf_file.pdf"
pdf_toread = PdfFileReader(open(mypath, 'rb'))
pdf_info = pdf_toread.getDocumentInfo()

For the document at hand the output is:

Out[230]: 
{'/CrossmarkDomainExclusive': 'true',
 '/CreationDate': "D:20181029074117+05'30'",
 '/CrossMarkDomains#5B2#5D': 'elsevier.com',
 '/Author': 'Nicola Gennaioli',
 '/Creator': 'Elsevier',
 '/ElsevierWebPDFSpecifications': '6.5',
 '/Subject': 'Journal of Monetary Economics, 98 (2018) 98-113. doi:10.1016/j.jmoneco.2018.04.011',
 '/CrossmarkMajorVersionDate': '2010-04-23',
 '/CrossMarkDomains#5B1#5D': 'sciencedirect.com',
 '/robots': 'noindex',
 '/ModDate': "D:20181029074135+05'30'",
 '/AuthoritativeDomain#5B1#5D': 'sciencedirect.com',
 '/Keywords': 'Sovereign Risk; Sovereign Default; Government Bonds',
 '/doi': '10.1016/j.jmoneco.2018.04.011',
 '/Title': 'Banks, government Bonds, and Default: What do the data Say?',
 '/AuthoritativeDomain#5B2#5D': 'elsevier.com',
 '/Producer': 'Acrobat Distiller 10.1.10 (Windows)'}

I found out, however, that that the PyPDF2 library does not have an attribute to "access" the information for the /Keywords part. That is, this bit of output:

'/Keywords': 'Sovereign Risk; Sovereign Default; Government Bonds',

So, I want some help on how I could get this information of the metadata output [in this example: Sovereign Risk; Sovereign Default; Government Bonds].

To reproduce the output I am sharing a link to the document

Doing for example

Update:

print(pdf_info.title)
Banks, government Bonds, and Default: What do the data Say?

print(pdf_info.subject)
Journal of Monetary Economics, 98 (2018) 98-113. doi:10.1016/j.jmoneco.2018.04.011

But when I am trying to do something similar for the /Keywords part I am getting the follwoing attribute error:

pdf_info.keywords
Traceback (most recent call last):

  File "<ipython-input-295-3852401ef983>", line 1, in <module>
    pdf_info.keywords

AttributeError: 'DocumentInformation' object has no attribute 'keywords'

Upvotes: 3

Views: 5761

Answers (1)

Jongware
Jongware

Reputation: 22457

The key /Keywords is actually present in the dictionary returned by getDocumentInfo, so you don't have to do anything special (except first testing if it is there or wrap this in a try, in case it is not present in another file):

from PyPDF2 import PdfFileReader    
mypath = "../Downloads/banks_gov_bonds_default.pdf"
pdf_toread = PdfFileReader(open(mypath, 'rb'))
pdf_info = pdf_toread.getDocumentInfo()
if '/Keywords' in pdf_info:
    print (pdf_info['/Keywords'])

prints

Sovereign Risk; Sovereign Default; Government Bonds

which indeed are the keywords in the field inside your sample PDF.

The other option is to add keywords to the exposed PDF properties by editing pdf.py inside the PYPDF2 folder where your pip placed it. You can find the creation of the title, creator, author and some more properties in the class DocumentInformation, somewhere around line 2781 in my version. The creation of all of these properties follow a simple scheme and so adding one is no problem at all:

keywords = property(lambda self: self.getText("/Keywords"))
"""Read-only property accessing the document's **producer**.
If the document was converted to PDF from another format, this is
the name of the application (for example, OSX Quartz) that converted
it to PDF. Returns a unicode string (``TextStringObject``)
or ``None`` if the producer is not specified."""
keywords_raw = property(lambda self: self.get("/Keywords"))
"""The "raw" version of producer; can return a ``ByteStringObject``."""

(I added keywords_raw only because the other properties did so as well. I can't tell off-hand what these are for, though.)

After that your code actually works:

from PyPDF2 import PdfFileReader    
mypath = "../Downloads/banks_gov_bonds_default.pdf"
pdf_toread = PdfFileReader(open(mypath, 'rb'))
pdf_info = pdf_toread.getDocumentInfo()
print (pdf_info.keywords)

Result, again:

Sovereign Risk; Sovereign Default; Government Bonds

Upvotes: 3

Related Questions