Reputation: 63
Actually my requirement is to search pdf files using the pdf content.
I have a folder with a lot of PDF files. I would like to develop an ASP.net application that enables the user to search pdf using the content provided by them inside a textbox.
how to perform this task? thank u in advance.
Upvotes: 1
Views: 7301
Reputation: 14236
Your task may be split into following subtasks:
To build index you may use some integrated solution like Apache Lucene or Lucene.Net or convert each PDF into text and build index from the text yourselves.
You may try Docotic.Pdf library for the indexer part (disclaimer: I work for Bit Miracle).
The library could be used to extract text from PDFs. It can extract text with or without formatting. The extracted text can be used to create an index.
The library can also retrieve a collection of words with their bounding rectangles from PDFs. This might be useful if you need to know exact position of a text in a file.
If you don't want to build an index then you still can use Docotic.Pdf to perform searches using a code like the following:
PdfDocument doc = new PdfDocument("file.pdf");
string textToSearch = "some text";
for (int i = 0; i < doc.Pages.Count; i++)
{
string pageText = doc.Pages[i].GetText();
int count = 0;
int lastStartIndex = pageText.IndexOf(textToSearch, 0, StringComparison.CurrentCultureIgnoreCase);
while (lastStartIndex != -1)
{
count++;
lastStartIndex = pageText.IndexOf(textToSearch, lastStartIndex + 1, StringComparison.CurrentCultureIgnoreCase);
}
if (count != 0)
Console.WriteLine("Page {0}: '{1}' found {2} times", i, textToSearch, count);
}
Upvotes: 1
Reputation: 1419
try Zoom Search it has a plugin for extracting pdf documents text (which you can search against) , and its easy to customize your search.You will need the standard edition which is not free (about $49).Zoom search does the searching for you out of the box, you do not need to do any complicated stuff eg if you prefer to extract the text from the pdf and then some how index it in a database for search or trying to use Lucene search engine which will require you to do implement /and customise(a bit of work). Zoom works well with ASP.NET and you just need to use the GUI for customizing your search(not a lot of coding is required).
Upvotes: 0
Reputation: 37566
You can use any library for that, try iTextSharp its a free one.
You can read pdf as text like this:
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}
Upvotes: 0