Ankur
Ankur

Reputation: 51100

From PDf to String

What is the easiest way to get the text (words) of a PDF file as one long String or array of Strings.

I have tried pdfbox but that is not working for me.

Upvotes: 7

Views: 17537

Answers (4)

yeaaaahhhh..hamf hamf
yeaaaahhhh..hamf hamf

Reputation: 788

Well, i have used Tika in order to extract raw text from pdf(it is based on PDFBox), but i think Tika is useful only when you have to extract text from different file formats(auto detection helps a lot).

If you want to parse only pdf's into text i would suggest PDFTextStream because it's a much better parser than other apis(such as iText and PDFBox).

With PDFTextStream you can easily get structured text (pages->blocks->lines->textUnits), and it gives you the possibility to extract correlated info such as character encoding, height, location of a character in the page etc..

Example:

public class ExtractTextAllPages {
    public static void main (String[] args) throws IOException {
        String pdfFilePath = args[0];
        PDFTextStream pdfts = new PDFTextStream(pdfFilePath); 
        StringBuilder text = new StringBuilder(1024);
        pdfts.pipe(new OutputTarget(text));
        pdfts.close();
        System.out.printf("The text extracted from %s is:", pdfFilePath);
        System.out.println(text);
    }
}

Upvotes: 0

mark stephens
mark stephens

Reputation: 449

JPedal and Multivalent also offer text extraction in Java or you could access xpdf using Runtime.exec

Upvotes: 1

Kushal Paudyal
Kushal Paudyal

Reputation: 3791

use iText. The following snippet for example will extract the text.

PdfTextExtractor parser =new PdfTextExtractor(new PdfReader("C:/Text.pdf"));
parser.getTextFromPage(3);

Upvotes: 4

Sam Barnum
Sam Barnum

Reputation: 10714

PDFBox barfs on many newer PDFs, especially those with embedded PNG images.

I was very impressed with PDFTextStream

Upvotes: 3

Related Questions