Reputation: 2494
I have the following problem: I have a lot of papers in pdf format and I have to extract information from the first page of each one and then save it into a database
I just need to extract, the title, the abstract, keywords, authors list, universities list, emails. I want to do a script to get a string for each one of that fields, for each paper.
How can I do that? Does anyone already did that? What languages and tools do you recommend me? and Does exist a paper repository that already do that database feeding?
Considering the pdfs could be with different encodings, I have to deal with this problem too. Any help with this would be great.
An example of a paper its here
Greetings!
Upvotes: 1
Views: 326
Reputation: 17334
You need a API to read your pdf.
Seems fine (I never try it though)
You can probably find others with this link :-)
Upvotes: 0
Reputation: 4455
You have to check about the security of the pdf, that it's really text and not an image. Check the command line application of pdfbox if it works extracting the text, then you can use the jar and use http://pdfbox.apache.org/apidocs/org/apache/pdfbox/examples/util/ExtractTextByArea.html
Hope it helps....
By the way it's java...
edit. I have not used this as a jar library http://www.qoppa.com/pdftext/, but I used the example application and it works, but I decided to go with pdfbox...
Upvotes: 1