Reputation: 301
I used PDFbox for parsing that pdf document.It throws exception that says it can not find header version info . Any idea?
I think version is 1.3 I saw it when I cast every byte to char . link is http://www.selab.isti.cnr.it/ws-mate/example.pdf
here codes of method and output:
public String PDFtest(String textLink) throws IOException{
PDFParser parser;
String parsedText = null;
PDFTextStripper pdfStripper;
PDDocument pdDoc;
COSDocument cosDoc;
PDDocumentInformation pdDocInfo;
StringBuilder sd=new StringBuilder();
URL link;
try {
link = new URL(textLink);
URLConnection urlConn = link.openConnection();
BufferedInputStream in = null;
in = new BufferedInputStream(urlConn.getInputStream());
byte data[] = new byte[1024];
in.read(data, 0, 1024);
parser = new PDFParser(in);
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
} catch (MalformedURLException ex) {
Logger.getLogger(HTMLhelper.class.getName()).log(Level.SEVERE, null, ex);
}
catch (NumberFormatException e){
System.out.println("hata");
}
return parsedText;
}
Exception:
Exception in thread "main" java.io.IOException: Error: Header doesn't contain versioninfo
at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:317)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:173)
at ParsingMachine.HTMLhelper.PDFtest(HTMLhelper.java:99)
at ParsingMachine.tester.main(tester.java:18)
Java Result: 1
Upvotes: 7
Views: 27327
Reputation: 830
Folder was outdated that was being parsed. Looked empty so it defaulted to Thumbs.db. I remember specifically skipping this, but guess not when folder was empty.
Updating the directory fixed.
(Similar scenario to murphy1310's but empty directory, i.e. no PDFs was the clue here)
Upvotes: 0
Reputation: 677
In my case, I was iterating through the files in a directory.
Windows has a Thumbs.db
file in any directory.
This was interfering with the pdf file process.
Applying a filter to only pick PDF files (*.pdf
) helped.
Cheers.
Upvotes: 2
Reputation: 141
You must be merging a file which is not in pdf format. Please check carefully if you have any file in the list other then pdf.
Upvotes: 14
Reputation: 95918
You first read the leading kilobyte of data into a byte array:
in.read(data, 0, 1024);
and then you expect PDFBox to get along with the remaining bytes
parser = new PDFParser(in);
parser.parse();
Most likely the actual PDF header is contained in those leading bytes you kept from the PDFBox parser.
Thus, simply allow PDFBox to read all data.
Upvotes: 0