extract text from a pdf file

Question

I am trying to extract text between "[" and "]" in a pdf file but I am unable to do so bcos the file seems to be encrypted. I am getting some symbols which is not in readable format..

public class ITextReadDemo {

      public static void main(String[] args) {
          try {
              PdfReader reader = new PdfReader("D:\temp\1.pdf");
              System.out.println("This PDF has "+reader.getNumberOfPages()+" pages.");
              String page = PdfTextExtractor.getTextFromPage(reader, 2);
              System.out.println("Page Content:

"+page+"

");
              System.out.println("Is this document tampered : "+reader.isTampered());
              System.out.println("Is this document encrypted : "+reader.isEncrypted());

          } catch (IOException e) {
              e.printStackTrace();
          }
      }
}

but I am getting this exception:

Exception in thread "main" java.lang.NoClassDefFoundError: org/bouncycastle/asn1/ASN1OctetString
    at com.itextpdf.text.pdf.PdfEncryption.(PdfEncryption.java:147)
    at com.itextpdf.text.pdf.PdfReader.readDecryptedDocObj(PdfReader.java:775)
    at com.itextpdf.text.pdf.PdfReader.readDocObj(PdfReader.java:1152)
    at com.itextpdf.text.pdf.PdfReader.readPdf(PdfReader.java:512)
    at com.itextpdf.text.pdf.PdfReader.(PdfReader.java:172)
    at com.itextpdf.text.pdf.PdfReader.(PdfReader.java:161)
    at pdfexc.ITextReadDemo.main(ITextReadDemo.java:19)
Caused by: java.lang.ClassNotFoundException: org.bouncycastle.asn1.ASN1OctetString
    at java.net.URLClassLoader.findClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    ... 7 more

I tried the following way also. It is reading the contents from the pdf file but when I display it, its not in the readable format

    void readfile() {
        Path path = Paths.get("D:\temp\1.pdf");
        Scanner scanner = new Scanner(path);
        while(scanner.hasNextLine()){
            String line = scanner.nextLine();
                System.out.println(line);
        }
}

All I need is the contents from the pdf file(not text file) as it is in readable format so that I can extract text b/w [ and ] using regex.. Please help me if you know the solution.

extract text from a pdf file

Answers (1)

Related Questions