Reputation: 51100
What is the easiest way to get the text (words) of a PDF file as one long String or array of Strings.
I have tried pdfbox but that is not working for me.
Upvotes: 7
Views: 17537
Reputation: 788
Well, i have used Tika in order to extract raw text from pdf(it is based on PDFBox), but i think Tika is useful only when you have to extract text from different file formats(auto detection helps a lot).
If you want to parse only pdf's into text i would suggest PDFTextStream because it's a much better parser than other apis(such as iText and PDFBox).
With PDFTextStream you can easily get structured text (pages->blocks->lines->textUnits), and it gives you the possibility to extract correlated info such as character encoding, height, location of a character in the page etc..
Example:
public class ExtractTextAllPages {
public static void main (String[] args) throws IOException {
String pdfFilePath = args[0];
PDFTextStream pdfts = new PDFTextStream(pdfFilePath);
StringBuilder text = new StringBuilder(1024);
pdfts.pipe(new OutputTarget(text));
pdfts.close();
System.out.printf("The text extracted from %s is:", pdfFilePath);
System.out.println(text);
}
}
Upvotes: 0
Reputation: 449
JPedal
and Multivalent
also offer text extraction in Java
or you could access xpdf
using Runtime.exec
Upvotes: 1
Reputation: 3791
use iText. The following snippet for example will extract the text.
PdfTextExtractor parser =new PdfTextExtractor(new PdfReader("C:/Text.pdf")); parser.getTextFromPage(3);
Upvotes: 4
Reputation: 10714
PDFBox barfs on many newer PDFs, especially those with embedded PNG images.
I was very impressed with PDFTextStream
Upvotes: 3