fespinozacast
fespinozacast

Reputation: 2494

Pdf Parsing Challenge

I have the following problem: I have a lot of papers in pdf format and I have to extract information from the first page of each one and then save it into a database

I just need to extract, the title, the abstract, keywords, authors list, universities list, emails. I want to do a script to get a string for each one of that fields, for each paper.

How can I do that? Does anyone already did that? What languages and tools do you recommend me? and Does exist a paper repository that already do that database feeding?

Considering the pdfs could be with different encodings, I have to deal with this problem too. Any help with this would be great.

An example of a paper its here

Greetings!

Upvotes: 1

Views: 326

Answers (2)

Luc M
Luc M

Reputation: 17334

You need a API to read your pdf.

Seems fine (I never try it though)

You can probably find others with this link :-)

Upvotes: 0

jjchiw
jjchiw

Reputation: 4455

http://pdfbox.apache.org/

You have to check about the security of the pdf, that it's really text and not an image. Check the command line application of pdfbox if it works extracting the text, then you can use the jar and use http://pdfbox.apache.org/apidocs/org/apache/pdfbox/examples/util/ExtractTextByArea.html

Hope it helps....

By the way it's java...

edit. I have not used this as a jar library http://www.qoppa.com/pdftext/, but I used the example application and it works, but I decided to go with pdfbox...

Upvotes: 1

Related Questions