user2963585
user2963585

Reputation: 63

PDF content search in asp.net c#

Actually my requirement is to search pdf files using the pdf content.

I have a folder with a lot of PDF files. I would like to develop an ASP.net application that enables the user to search pdf using the content provided by them inside a textbox.

how to perform this task? thank u in advance.

Upvotes: 1

Views: 7301

Answers (3)

Bobrovsky
Bobrovsky

Reputation: 14236

Your task may be split into following subtasks:

  1. Develop indexer that will index all of your PDF files
  2. Develop the code to locate relevant PDF whenever a search performed (using the index, of course)
  3. Develop functionality that will open relevant PDF or show a warning if nothing was found

To build index you may use some integrated solution like Apache Lucene or Lucene.Net or convert each PDF into text and build index from the text yourselves.

You may try Docotic.Pdf library for the indexer part (disclaimer: I work for Bit Miracle).

The library could be used to extract text from PDFs. It can extract text with or without formatting. The extracted text can be used to create an index.

The library can also retrieve a collection of words with their bounding rectangles from PDFs. This might be useful if you need to know exact position of a text in a file.

If you don't want to build an index then you still can use Docotic.Pdf to perform searches using a code like the following:

PdfDocument doc = new PdfDocument("file.pdf");
string textToSearch = "some text";
for (int i = 0; i < doc.Pages.Count; i++)
{
    string pageText = doc.Pages[i].GetText();
    int count = 0;
    int lastStartIndex = pageText.IndexOf(textToSearch, 0, StringComparison.CurrentCultureIgnoreCase);
    while (lastStartIndex != -1)
    {
        count++;
        lastStartIndex = pageText.IndexOf(textToSearch, lastStartIndex + 1, StringComparison.CurrentCultureIgnoreCase);
    }

    if (count != 0)
        Console.WriteLine("Page {0}: '{1}' found {2} times", i, textToSearch, count);
}

Upvotes: 1

kwiri
kwiri

Reputation: 1419

try Zoom Search it has a plugin for extracting pdf documents text (which you can search against) , and its easy to customize your search.You will need the standard edition which is not free (about $49).Zoom search does the searching for you out of the box, you do not need to do any complicated stuff eg if you prefer to extract the text from the pdf and then some how index it in a database for search or trying to use Lucene search engine which will require you to do implement /and customise(a bit of work). Zoom works well with ASP.NET and you just need to use the GUI for customizing your search(not a lot of coding is required).

Upvotes: 0

CloudyMarble
CloudyMarble

Reputation: 37566

You can use any library for that, try iTextSharp its a free one.

You can read pdf as text like this:

public string ReadPdfFile(string fileName)
{
    StringBuilder text = new StringBuilder();

    if (File.Exists(fileName))
    {
        PdfReader pdfReader = new PdfReader(fileName);

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
            string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

            currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
            text.Append(currentText);
        }
        pdfReader.Close();
    }
    return text.ToString();
}

Upvotes: 0

Related Questions