Raj
Raj

Reputation: 19

extract text from a pdf file

I am trying to extract text between "[" and "]" in a pdf file but I am unable to do so bcos the file seems to be encrypted. I am getting some symbols which is not in readable format..

public class ITextReadDemo {

      public static void main(String[] args) {
          try {
              PdfReader reader = new PdfReader("D:\\temp\\1.pdf");
              System.out.println("This PDF has "+reader.getNumberOfPages()+" pages.");
              String page = PdfTextExtractor.getTextFromPage(reader, 2);
              System.out.println("Page Content:\n\n"+page+"\n\n");
              System.out.println("Is this document tampered : "+reader.isTampered());
              System.out.println("Is this document encrypted : "+reader.isEncrypted());

          } catch (IOException e) {
              e.printStackTrace();
          }
      }
}

but I am getting this exception:

Exception in thread "main" java.lang.NoClassDefFoundError: org/bouncycastle/asn1/ASN1OctetString
    at com.itextpdf.text.pdf.PdfEncryption.<init>(PdfEncryption.java:147)
    at com.itextpdf.text.pdf.PdfReader.readDecryptedDocObj(PdfReader.java:775)
    at com.itextpdf.text.pdf.PdfReader.readDocObj(PdfReader.java:1152)
    at com.itextpdf.text.pdf.PdfReader.readPdf(PdfReader.java:512)
    at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:172)
    at com.itextpdf.text.pdf.PdfReader.<init>(PdfReader.java:161)
    at pdfexc.ITextReadDemo.main(ITextReadDemo.java:19)
Caused by: java.lang.ClassNotFoundException: org.bouncycastle.asn1.ASN1OctetString
    at java.net.URLClassLoader.findClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    ... 7 more

I tried the following way also. It is reading the contents from the pdf file but when I display it, its not in the readable format

    void readfile() {
        Path path = Paths.get("D:\\temp\\1.pdf");
        Scanner scanner = new Scanner(path);
        while(scanner.hasNextLine()){
            String line = scanner.nextLine();
                System.out.println(line);
        }
}

All I need is the contents from the pdf file(not text file) as it is in readable format so that I can extract text b/w [ and ] using regex.. Please help me if you know the solution.

Upvotes: 0

Views: 362

Answers (1)

mkl
mkl

Reputation: 96009

The cause of your problems is already described by the exception:

Exception in thread "main" java.lang.NoClassDefFoundError: org/bouncycastle/asn1/ASN1OctetString

IText uses the BouncyCastle library for security related tasks like encryption and signing and you seem to not have that library in your class path or at least not the required version of it.

Unfortunately don't say which iText version you use so i cannot tell which BouncyCastle version is the required one.

Upvotes: 1

Related Questions