PDF to Text: iTextSharp: Duplicate Pages in Extraction Results

Question

Thanks in advance.

The Background:

I'm working on a console application that extracts data from specific sections in pdf documents. To do this I first need to convert that pdf into a string to work with. To do this I turned to iTextSharp. The pdfs are laid out with two columns per page so I'm using the SimpleTextExtractionStratgey() (I tried iTextSharp.text.pdf.parser.LocationTextExtractionStrategy(); but found it ineffective for the page layout).

Description of content being converted to text:

The pages I seem to be having trouble with have a "header" posted up on the side of the page. Pages with headers are intermittently dispersed through the document.

Image of page layout: http://postimg.org/image/b7i25v0g1/

The Problem:

It seems when it finishes looking through the columns on the page then moves on to that side header. It would then jump to the next page with a side header, convert that to text, then start again from the top of the page where the first header was encountered.

I'd end up with text that looks like:

Page 1 Content

First Header

Second Header

Page 1 Content

Page 2 Content

etc.

Here is the pdf: http://www.filedropper.com/dd35-completeadventurer

I'm not married to iTextSharp I just need a reliable way to convert documents with this format to text. A work around or alternate method would be appreciated.

    static public string ToTxt(string @filePath)
    {
        string strText = string.Empty;
        try
        {
            PdfReader reader = new PdfReader(filePath);

            for (int page = 1; page <= reader.NumberOfPages; page++)
            {

                Widgets.ProgressBar(page);

                //Convert PDF to Text
                ITextExtractionStrategy its = new SimpleTextExtractionStrategy(); //iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
                String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
                strText = strText + s;
            }
            reader.Close();
            Console.WriteLine("File Extracted");
        }
        catch (Exception e)
        {
            Console.WriteLine("Exception: " + e.Message);
        }
        finally
        {
            Console.Clear();
        }
        return strText;
     }

mkl · Accepted Answer

As already conjectured in a comment, the duplicate text already is present in the PDF content!

Details

The page contents of pairs of pages facing each other in your document often are identical, each time the contents of the whole spread, and the individual pages merely display only the left or the right half respectively.

E.g. consider the two pages 6 and 7. Their contents are identical:

spread of pages 6 and 7

filling the area of their identical MediaBox. Merely by setting the CropBox (and the ArtBox, BleedBox, and TrimBox) to the left or right half respectively, only the expected content is shown for page 6:

and page 7:

Neither the iText(Sharp) parser framework nor the SimpleTextExtractionStrategy automatically restrict to these boxes, they extract all text drawn anywhere in the content. Thus, the duplicate text.

Preventing duplicate text in the extraction result

Knowing the cause for the text duplication, there are multiple ways to prevent it:

You can try and extract the content only of every other PDF page. Unfortunately the above said is not true for all pages, at least the initial pages (title page, contents, ...) are not created using the scheme explained above, and further into the book there are some artwork pages not following the scheme either. Thus, this option would require quite some management of exceptional pages.
You can extract the contents of each page but keep the contents of the previously processed page in some variable. Now only add the newly extracted content to the result if it does not equal the content of the prior page.
You can use the iText(Sharp) parser filters. If you restrict the text chunks processed by your strategy to only those drawn inside the crop box of the current page, you prevent duplicate text caused by off-page content. You can find an example filtering by region here: ExtractPageContentArea.java / ExtractPageContentArea.cs.

PDF to Text: iTextSharp: Duplicate Pages in Extraction Results

Answers (2)

Details

Preventing duplicate text in the extraction result

Related Questions