Finding the title of a PDF with Python

Question

I have a PDF file, and I would like to extract its title into a string. By title I don't mean the title in the metadata, but the actual title written in the document. For example, from here I'd like to get "Official SAT® Practice Test 2014-15"

Is there any way to accomplish this?

Brad Ruderman · Accepted Answer

I would take a look at PDFMiner. Essentially you can load your PDF programatically. Then you will need to do some type of analysis to figure out how to extract the title. Perhaps you try using the first until new line break, or some type of algorithmic approach. I recommend using a large set of PDFs where you know the title, and run your program against them to test to see if you successfully detect the title. Then you can use that code to process the PDFs where you don't know the title. This technique is commonly referred to as using a training set.

Finding the title of a PDF with Python

Answers (1)

Related Questions