Extract titles from each page of a PDF?

Question

I am working on a project, SIGGRAPH Image Wall.

My first challenge is to figure out how to extract titles of each page in a PDF, SIGGRAPH 2013 Technical Papers First Pages (44 MB PDF). This PDF is a compilation of the first page of each papers. Therefore, there is a paper title for each page, a little different from the traditional scholar paper. Does anyone have any idea for this?

uptownnickbrown · Accepted Answer

I think you can accomplish this using any of a number of text extraction approaches, though I will caution that getting to 100% accuracy will be tricky...

Some possible tools to use:

pdftotext or pdf2txt - Simple and easy cross-platform extraction utilities.
PDFNet - Robust SDK for digging into PDFs and pulling out exactly the data you want.
Perl modules: PDF::API2, CAM::PDF - I'm a Perl guy so I'd go this route, but I'm sure similar libraries exist in Python, Ruby, etc.

Your source pages look reasonably consistent - I feel like you'll be able to make some smart guesses about where on the page your content will be and what it'll look like. I'd try this out:

Inspect the PDF manually to figure out the title font name and size.
Extract text information for the top portion of the page (something like the top 150 pixels). Make sure to extract font info.
This should get all of your title text and maybe some author names. Parse this data (either within the script you write, or in the XML output files from pdftotext, etc.), keeping only the words that match your title font info.

If the title font varies, you'll need to guess what the title font is for each page and differentiate it from author names (the only other content you should get from the top of the page) which you can probably do simply by comparing font sizes.

Extract titles from each page of a PDF?

Answers (1)

Related Questions