itsvks
itsvks

Reputation: 403

How to parse pdf which contain data in a tabular format using pdfbox

Can anybody help me about how to extract table data using itext or pdfbox, i have have a pdf with 1000 pages, my job is to parse a pdf and store data into database.

Upvotes: 5

Views: 4021

Answers (2)

user2879704
user2879704

Reputation:

you can use this code to extract the data in a string format:

PDDocument document = PDDocument.load(pathToFile);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);

then you can use java regular expression to parse row by row and load values into your java POJO beans.

Upvotes: 1

mark stephens
mark stephens

Reputation: 3184

PDFs do not contain any table structure elements unless is contains additional XML to define the table. Otherwise there is no structure. There is a blog article I wrote on how to find out.

Some tools like PdfBox will make an effort to guess the table but it can be hit and miss

Upvotes: 4

Related Questions