Reputation: 611
I was trying to read a PDF using Itextsharp in .NET application. I am able to read individual word sccessfully . The challenge I am facing now is reading a table. I have a table structure like this:
Please note that here some of column names are two-lined. for example Department Code and Employee Identification Number.
So my requirement is to read the Employee Identification number and salary if the employee belongs to 'HR' department. For this I have to check whether a column named 'Department Code' exists in the PDF file.
When I read this table using iTextsharp, what happens is let us say 'Department' part of 'Department Code' column comes at poition 1 , but the 'Code' comes 5th position.This is because this column is displayed in two lines and there are other four words exists in the pdf before I read the 'Code'part of this column. I am totally stuck at this :(
Anybody has any idea how to make sure that a column name 'Department Code' exists and read corresponding values from this table.
Appreciate your help!
Regards, Jaleel
Upvotes: 0
Views: 2027
Reputation: 55417
Unfortunately PDFs don't actually have a concept of "tables". What looks like a table is just a bunch of arbitrary text that happens to have lines around it. Most PDF creation libraries allow you to create content from a "table" but ultimately those turn them into text and unrelated lines. Also, what you see as a "blank cell" is probably actually no text at all (although it could be a space).
For this kind of thing you're pretty much just going to have to come up with some arbitrary rules specific to your document. You could try to calculate where lines exist relative to text and try to rebuild your table in a more logical format but you're going to be hard-pressed to do that.
Upvotes: 1