brsblge
brsblge

Reputation: 33

How to search in a pdf file by font size or keep apart footer while searching?

I am working on a project that I have to search some text in some pdf files. Those pdf files' pages have footer parts. In footers, text font size is different than main content. I'm using iTextSharp's PdfReader class and I don't want it to search the text I give in footer parts. I think the solution must be either to search by font size, or ignore footers. Any idea?

Here is my code:

private List<int> ReadPdfFile(string fileName, String searchText, int index)
    {
        List<int> pages = new List<int>();
        if (File.Exists(fileName))
        {
            for (int page = 1; page <= pdfReaders[index].NumberOfPages; page++)
            {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReaders[index], page, strategy);
                if (currentPageText.Contains(searchText))
                {
                    pages.Add(page);
                }
            }
        }
        return pages;
    }

Upvotes: 0

Views: 469

Answers (1)

mkl
mkl

Reputation: 95953

If one only wants to extract a certain part of the text of a page, e.g.

  • only text located in a given part of the page area, for example the left half page (in case of two columns), between given y values (to exclude headers and footers), or outside the crop box (to detect text hidden there), or

  • only text in a given style, for example only red text, only text of a given size range, ...

one can filter the information the text extraction strategy receives as input by using a FilteredTextRenderListener with matching RenderFilter instances:

RenderFilter filter = ...;
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
ITextExtractionStrategy filtered = new FilteredTextRenderListener(strategy, filter);
string filteredCurrentPageText = PdfTextExtractor.GetTextFromPage(pdfReaders[index], page, filtered);

Your filter class merely must extend the abstract class RenderFilter and override the Allow* methods as desired:

public abstract class RenderFilter
{
    public virtual bool AllowText(TextRenderInfo renderInfo)
    {
        return true;
    }

    public virtual bool AllowImage(ImageRenderInfo renderInfo)
    {
        return true;
    }
}

TextRenderInfo makes many properties of the inflowing text chunks available to filter by.

Upvotes: 1

Related Questions