Reputation: 15122
I am working on a project, SIGGRAPH Image Wall.
My first challenge is to figure out how to extract titles of each page in a PDF, SIGGRAPH 2013 Technical Papers First Pages (44 MB PDF). This PDF is a compilation of the first page of each papers. Therefore, there is a paper title for each page, a little different from the traditional scholar paper. Does anyone have any idea for this?
Upvotes: 2
Views: 4717
Reputation: 997
I think you can accomplish this using any of a number of text extraction approaches, though I will caution that getting to 100% accuracy will be tricky...
Some possible tools to use:
Your source pages look reasonably consistent - I feel like you'll be able to make some smart guesses about where on the page your content will be and what it'll look like. I'd try this out:
If the title font varies, you'll need to guess what the title font is for each page and differentiate it from author names (the only other content you should get from the top of the page) which you can probably do simply by comparing font sizes.
Upvotes: 2