amitaks2
amitaks2

Reputation: 66

How to read pdf table content data?

I have a requirement to read a pdf file having tabular format data only like in excel file. I need to extract the cell value of given pdf file. Is it be anyhow possible using itext API. If you have something to share then please share it or any other solutions?

Upvotes: 2

Views: 1370

Answers (2)

Shaun Poore
Shaun Poore

Reputation: 642

I recently ran into this problem. I wasn't able to make it work with itext.

An alternate solution I found was to open a PDF document in Adobe and export it to xml. At least with my PDF's it preserved the table information and then I was able to programmatically work with the XML to generate tabular files like excel etc.

The other issue I ran into was that Adobe only lets you export one file at a time and I had lots of files. Luckily Adobe also has a merge function. I ended up merging all the files together and then exporting them as one big XML file and working with that file to generate what I needed.

Upvotes: 0

Next Door Engineer
Next Door Engineer

Reputation: 2876

The PDF format is just a canvas where text and graphics are placed without any structure information. As such there aren't any iText-objects in a PDF file. In each page there will probably be a number of Strings, but you can't reconstruct a phrase or a paragraph using these strings. There are probably a number of lines drawn, but you can't retrieve a Table-object based on these lines.

In short: parsing the content of a PDF-file is NOT POSSIBLE with iText.

You can try this! This lets you read PDF pages.

Upvotes: 2

Related Questions