Pdf Parsing Challenge

Question

I have the following problem: I have a lot of papers in pdf format and I have to extract information from the first page of each one and then save it into a database

I just need to extract, the title, the abstract, keywords, authors list, universities list, emails. I want to do a script to get a string for each one of that fields, for each paper.

How can I do that? Does anyone already did that? What languages and tools do you recommend me? and Does exist a paper repository that already do that database feeding?

Considering the pdfs could be with different encodings, I have to deal with this problem too. Any help with this would be great.

An example of a paper its here

Greetings!

jjchiw · Accepted Answer

http://pdfbox.apache.org/

You have to check about the security of the pdf, that it's really text and not an image. Check the command line application of pdfbox if it works extracting the text, then you can use the jar and use http://pdfbox.apache.org/apidocs/org/apache/pdfbox/examples/util/ExtractTextByArea.html

Hope it helps....

By the way it's java...

edit. I have not used this as a jar library http://www.qoppa.com/pdftext/, but I used the example application and it works, but I decided to go with pdfbox...

Pdf Parsing Challenge

Answers (2)

Related Questions