user2638084
user2638084

Reputation: 301

pdfbox header version info error

I used PDFbox for parsing that pdf document.It throws exception that says it can not find header version info . Any idea?

I think version is 1.3 I saw it when I cast every byte to char . link is http://www.selab.isti.cnr.it/ws-mate/example.pdf

here codes of method and output:

 public String PDFtest(String textLink) throws IOException{
        PDFParser parser;
        String parsedText = null;
        PDFTextStripper pdfStripper;
        PDDocument pdDoc;
        COSDocument cosDoc;
        PDDocumentInformation pdDocInfo;


    StringBuilder sd=new StringBuilder();
    URL link;
    try {
        link = new URL(textLink);
        URLConnection urlConn = link.openConnection();
        BufferedInputStream in = null;
        in = new BufferedInputStream(urlConn.getInputStream());
        byte data[] = new byte[1024];
        in.read(data, 0, 1024);

    parser = new PDFParser(in);
    parser.parse();
    cosDoc = parser.getDocument();
    pdfStripper = new PDFTextStripper();
    pdDoc = new PDDocument(cosDoc);
    parsedText = pdfStripper.getText(pdDoc);
    } catch (MalformedURLException ex) {
        Logger.getLogger(HTMLhelper.class.getName()).log(Level.SEVERE, null, ex);
    }
    catch (NumberFormatException e){
        System.out.println("hata");
    }

    return parsedText;



}

Exception:

Exception in thread "main" java.io.IOException: Error: Header doesn't contain versioninfo
    at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:317)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:173)
    at ParsingMachine.HTMLhelper.PDFtest(HTMLhelper.java:99)
    at ParsingMachine.tester.main(tester.java:18)
Java Result: 1

Upvotes: 7

Views: 27327

Answers (4)

user176692
user176692

Reputation: 830

Folder was outdated that was being parsed. Looked empty so it defaulted to Thumbs.db. I remember specifically skipping this, but guess not when folder was empty.

Updating the directory fixed.

(Similar scenario to murphy1310's but empty directory, i.e. no PDFs was the clue here)

Upvotes: 0

murphy1310
murphy1310

Reputation: 677

In my case, I was iterating through the files in a directory.
Windows has a Thumbs.db file in any directory.
This was interfering with the pdf file process.
Applying a filter to only pick PDF files (*.pdf) helped.
Cheers.

Upvotes: 2

asraniinfo
asraniinfo

Reputation: 141

You must be merging a file which is not in pdf format. Please check carefully if you have any file in the list other then pdf.

Upvotes: 14

mkl
mkl

Reputation: 95918

You first read the leading kilobyte of data into a byte array:

in.read(data, 0, 1024);

and then you expect PDFBox to get along with the remaining bytes

parser = new PDFParser(in);
parser.parse();

Most likely the actual PDF header is contained in those leading bytes you kept from the PDFBox parser.

Thus, simply allow PDFBox to read all data.

Upvotes: 0

Related Questions