Reputation: 616
i need to extract text from a pdf file using java. I found iText but it doesn't work the way i wanted it to. Here's my code
package com.itextpdf.mavenproject1;
import com.itextpdf.forms.PdfAcroForm;
import com.itextpdf.forms.fields.PdfButtonFormField;
import com.itextpdf.forms.fields.PdfFormField;
import com.itextpdf.io.font.FontConstants;
import com.itextpdf.kernel.font.PdfFontFactory;
import com.itextpdf.kernel.geom.Rectangle;
import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
import com.itextpdf.kernel.pdf.PdfString;
import com.itextpdf.kernel.pdf.PdfWriter;
import com.itextpdf.kernel.pdf.action.PdfAction;
import com.itextpdf.kernel.pdf.annot.PdfAnnotation;
import com.itextpdf.kernel.pdf.annot.PdfTextAnnotation;
import com.itextpdf.kernel.pdf.canvas.PdfCanvas;
import com.itextpdf.kernel.pdf.canvas.parser.PdfTextExtractor;
import com.itextpdf.test.annotations.WrapToTest;
import java.io.File;
import java.io.IOException;
public class zczytywanie {
public static void main(String args[]) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader("D:/pdf/pdf"));
String page= PdfTextExtractor.getTextFromPage(pdfDoc, 1);
System.out.println(page);
}
}
And it tells me that there is an error in the line where i try to use PDdfTextExtractor (PdfDocument can not be converted to pdfPage, although i found that pdfDoc has to be PdfReader)
It doesn't work with
PdfReader pdfDoc = new PdfReader("D:/pdf/pdf");
either.
Upvotes: 0
Views: 1615
Reputation: 147
You can try PDFBox or Tikka. But here I am giving an example for PDFBox
Add the PDFBox jar dependency to your pom.xml.
<dependencies>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.23</version>
</dependency>
</dependencies>
Java class
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.io.File;
import java.io.IOException;
public class TestPDF {
public static void main(String[] args) {
try (PDDocument document = PDDocument.load(new File("/path_to_your_pdf_file"))) {
document.getClass();
if(!document.isEncrypted()){
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
System.out.println("Text:" + pdfFileInText);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Upvotes: 0