Reputation: 33
I am trying to parse a TABLE in PDF file and display it as CSV. I have attached sample data from PDF below(only few columns) and sample output for the same. Each column width is fixed, let's say Company Name(18 char),Amount(8 char), Type(5 char) etc. I tried using Itext and PDFBox jars to get each page data and parsed line by line, but sounds like it is not a great solution as the line breaks and page breaks in PDF are not proper. Please me let me know if there is any other appropriate solution. We want to use any open source software for this.
Upvotes: 1
Views: 1563
Reputation: 9057
This is a very complex problem. There are multiple master dissertations about this even.
An easy analogy: I have 5000 puzzle-pieces, all of them are perfectly square and could fit anywhere. Some of them have pieces of lines on them, some of them have snippets of text.
However, that does not mean it can't be done. It'll just take work.
General approach:
This high-level approach should make it painfully obvious why this is not a widely available thing. It's very hard to implement. It requires domain-knowledge of both PDF, fonts, and machine-learning.
If you are ok with commercial solutions, try out pdf2Data. It's an iText add-on that features this exact functionality.
http://itextpdf.com/itext7/pdf2Data
Upvotes: 3