Reputation: 45
Thanks in advance.
The Background:
I'm working on a console application that extracts data from specific sections in pdf documents. To do this I first need to convert that pdf into a string to work with. To do this I turned to iTextSharp. The pdfs are laid out with two columns per page so I'm using the SimpleTextExtractionStratgey() (I tried iTextSharp.text.pdf.parser.LocationTextExtractionStrategy(); but found it ineffective for the page layout).
Description of content being converted to text:
The pages I seem to be having trouble with have a "header" posted up on the side of the page. Pages with headers are intermittently dispersed through the document.
Image of page layout: http://postimg.org/image/b7i25v0g1/
The Problem:
It seems when it finishes looking through the columns on the page then moves on to that side header. It would then jump to the next page with a side header, convert that to text, then start again from the top of the page where the first header was encountered.
I'd end up with text that looks like:
Page 1 Content
First Header
Second Header
Page 1 Content
Page 2 Content
etc.
Here is the pdf: http://www.filedropper.com/dd35-completeadventurer
I'm not married to iTextSharp I just need a reliable way to convert documents with this format to text. A work around or alternate method would be appreciated.
static public string ToTxt(string @filePath)
{
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
Widgets.ProgressBar(page);
//Convert PDF to Text
ITextExtractionStrategy its = new SimpleTextExtractionStrategy(); //iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = strText + s;
}
reader.Close();
Console.WriteLine("File Extracted");
}
catch (Exception e)
{
Console.WriteLine("Exception: " + e.Message);
}
finally
{
Console.Clear();
}
return strText;
}
Upvotes: 1
Views: 2227
Reputation: 45
Using mkl's second method (checking each page for repeat) I came up with the following and it works brilliantly; an easy fix:
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
string prevPage = "";
for (int page = 1; page <= reader.NumberOfPages; page++)
{
Widgets.ProgressBar(page);
//Convert PDF to Text
ITextExtractionStrategy its = new SimpleTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
if (prevPage != s)
strText += s;
prevPage = s;
}
reader.Close();
Console.WriteLine("File Extracted");
}
catch (Exception e)
{
Console.WriteLine("Exception: " + e.Message);
}
finally
{
Console.Clear();
}
return strText;
}
Upvotes: 1
Reputation: 95963
As already conjectured in a comment, the duplicate text already is present in the PDF content!
The page contents of pairs of pages facing each other in your document often are identical, each time the contents of the whole spread, and the individual pages merely display only the left or the right half respectively.
E.g. consider the two pages 6 and 7. Their contents are identical:
filling the area of their identical MediaBox. Merely by setting the CropBox (and the ArtBox, BleedBox, and TrimBox) to the left or right half respectively, only the expected content is shown for page 6:
and page 7:
Neither the iText(Sharp) parser framework nor the SimpleTextExtractionStrategy
automatically restrict to these boxes, they extract all text drawn anywhere in the content. Thus, the duplicate text.
Knowing the cause for the text duplication, there are multiple ways to prevent it:
You can try and extract the content only of every other PDF page. Unfortunately the above said is not true for all pages, at least the initial pages (title page, contents, ...) are not created using the scheme explained above, and further into the book there are some artwork pages not following the scheme either. Thus, this option would require quite some management of exceptional pages.
You can extract the contents of each page but keep the contents of the previously processed page in some variable. Now only add the newly extracted content to the result if it does not equal the content of the prior page.
You can use the iText(Sharp) parser filters. If you restrict the text chunks processed by your strategy to only those drawn inside the crop box of the current page, you prevent duplicate text caused by off-page content. You can find an example filtering by region here: ExtractPageContentArea.java / ExtractPageContentArea.cs.
Upvotes: 1