Reputation: 403
Can anybody help me about how to extract table data using itext or pdfbox, i have have a pdf with 1000 pages, my job is to parse a pdf and store data into database.
Upvotes: 5
Views: 4021
Reputation:
you can use this code to extract the data in a string format:
PDDocument document = PDDocument.load(pathToFile);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);
then you can use java regular expression to parse row by row and load values into your java POJO beans.
Upvotes: 1
Reputation: 3184
PDFs do not contain any table structure elements unless is contains additional XML to define the table. Otherwise there is no structure. There is a blog article I wrote on how to find out.
Some tools like PdfBox will make an effort to guess the table but it can be hit and miss
Upvotes: 4