Ros
Ros

Reputation: 411

Parsing Complex PDF document with C#

See attached K-1 Document. I have attempted to use numerous tweaks with iTextSharp library but haven't had success in loading data correctly.

Ideally I would like to parse out the document similar to how humans would read them, one textbox at a time, reading its contents.

       var reader = new PdfReader(FILE, Encoding.ASCII.GetBytes(password));
        string[] lines;
        var strategy = new LocationTextExtractionStrategy();
        string currentPageText = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);
        lines = currentPageText.Split(new string[] {"\r\n", "\n"}, StringSplitOptions.None);

I also tried playing with Annotation parsing but didn't have luck.

I'm a newbie and probably looking at wrong place. Can you help guide me in the right direction?

Thanks a lot.

enter image description here

Upvotes: 4

Views: 8331

Answers (3)

Vadim
Vadim

Reputation: 414

Take a look at IvyPdf library and template editor. It's using c# and provides high-level functions to parse and extract data so you don't have to deal with internals of PDF documents. You can build fairly complex scenarios using it.

I don't think it can read annotations though.

Upvotes: 2

Eugene
Eugene

Reputation: 2878

The first question if this form is electronic or a scanned one? the latter would make the data extraction much harder as it should involve OCR too.

in case you have electronic PDF and if you have all the similar forms then why don't you just use the following strategy:

  • store coordinates of each "box" in the config file
  • process documents and exract text from every "box" (i.e. region)
  • additional process extracted text with regular expressions to separate name from address (or maybe you may just set the region to read text from line by line)

In case you have few variations of the form then you may check the very first box to extract the name of the form and load the appropraite settings file (that contains a set of regions for that variation)

This approach should work with any PDF library.

Upvotes: 3

mkl
mkl

Reputation: 95898

You would like to parse out the document similar to how humans would read them, one textbox at a time, reading its contents. That means you first will have to try and automatically recognize those text boxes. Then you can extract text by these areas.

To recognize those text boxes automatically in your document, you have to extract the border lines enclosing the boxes. For this you will first have to find out how those border lines are created. They might be drawn using vector graphics as lines or rectangles, but they could also be part of a background bitmap image.

Unfortunately I don't have your IRS form at hand and so cannot analyze its internals. Let's assume the borders are created using vector graphics for now. Thus, you have to extract vector graphics.

To extract vector graphics with iText(Sharp), you make use of classes from the iText(Sharp) parser namespace by making them parse the document and feed the parsing events into a listener you create which collects the vector graphic operations:

  • You implement IExtRenderListener, in particular its ModifyPath and RenderPath methods which respectively are called when additional path elements (e.g. lines or rectangles) are added to the current path or when the current path is rendered (stroked? filled?). Your implementation collects these information.
  • You parse your document into an instance of your listener, e.g. using PdfReaderContentParser.
  • You analyse the lines and rectangles found and derive the coordinates of the boxes they build.
  • You parse the same page in a LocationTextExtractionStrategy instance.
  • You retrieve the texts of the recognized text boxes by calling LocationTextExtractionStrategy.GetResultantText with a matching ITextChunkFilter argument for each box.

(Actually you can do the parsing into the instance of your listener and the LocationTextExtractionStrategy instance in one pass for a bit of optimization.)

All iText(Sharp) specific tasks are trivial, and the only other task, the analysis of the lines and rectangles found to derive the coordinates of the boxes, should be no big problem for a software developer proficient in C#.

Upvotes: 3

Related Questions