Reputation: 33
I am working on a project that I have to search some text in some pdf files. Those pdf files' pages have footer parts. In footers, text font size is different than main content. I'm using iTextSharp's PdfReader class and I don't want it to search the text I give in footer parts. I think the solution must be either to search by font size, or ignore footers. Any idea?
Here is my code:
private List<int> ReadPdfFile(string fileName, String searchText, int index)
{
List<int> pages = new List<int>();
if (File.Exists(fileName))
{
for (int page = 1; page <= pdfReaders[index].NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReaders[index], page, strategy);
if (currentPageText.Contains(searchText))
{
pages.Add(page);
}
}
}
return pages;
}
Upvotes: 0
Views: 469
Reputation: 95953
If one only wants to extract a certain part of the text of a page, e.g.
only text located in a given part of the page area, for example the left half page (in case of two columns), between given y values (to exclude headers and footers), or outside the crop box (to detect text hidden there), or
only text in a given style, for example only red text, only text of a given size range, ...
one can filter the information the text extraction strategy receives as input by using a FilteredTextRenderListener
with matching RenderFilter
instances:
RenderFilter filter = ...;
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
ITextExtractionStrategy filtered = new FilteredTextRenderListener(strategy, filter);
string filteredCurrentPageText = PdfTextExtractor.GetTextFromPage(pdfReaders[index], page, filtered);
Your filter class merely must extend the abstract class RenderFilter
and override the Allow*
methods as desired:
public abstract class RenderFilter
{
public virtual bool AllowText(TextRenderInfo renderInfo)
{
return true;
}
public virtual bool AllowImage(ImageRenderInfo renderInfo)
{
return true;
}
}
TextRenderInfo
makes many properties of the inflowing text chunks available to filter by.
Upvotes: 1