Majd
Majd

Reputation: 1368

Text indexing algorithm

I am writing a C# winform application for an archiving system. The system has a huge database where some tables would have more than 1.5 million records. What i need is an algorithm that indexes the content of these records. Mainly, the files are Microsoft office, PDF and TXT documents. anyone can help? whether with ideas, links, books or codes, I appreciate it :)

example: if i search for the word "international" in a certain folder in the database, i get all the files that contain that word ordered by a certain criteria such as relevance, modifying date...etc

Upvotes: 11

Views: 5814

Answers (3)

user372724
user372724

Reputation:

According to me, perform a table partition, index the tables with the id's and then perform the search.

Upvotes: 0

Kibbee
Kibbee

Reputation: 66132

It looks like you need two things. Firstly, you need a system which actually performs the indexing. For this, you can go with Lucene, or Apache Solr as Mikos mentioned. You also might want to check out Sphinx which is another full text search engine. You could also use the full text features built into your database. Both SQL Server and MySQL have full text indexing capabilities. As do many other databases. The second thing you need is a way to get the text out of the files. For things like txt files, and HTML files, this is easy because most full text search engines will accept them as regular text. For more complicated binary documents like MS Word or PDF, you'll have to find another way to get the text out of them.

Upvotes: 2

Mikos
Mikos

Reputation: 8553

You need to create, what is known as an inverted index - which is at the core of how search engines work (a la Google). Apache Lucene is arguably the best library for inverted indexing. You have 2 options:

  1. Lucene.net - a .NET port of the Java Lucene library.

  2. Apache Solr - a full-fledged search server built using Lucene libs and easily integrable into your .NET application because it has a RESTful API. Comes out-of-the-box with several features such as caching, scaling, spell-checking, etc. You can make life easier for your app-to-Solr interaction using the excellent SolrNet library.

  3. Apache Tika offers a very extensive data/metadata extraction toolkit working with PDFs, HTMLs, MS Office docs etc. A simpler option would be to the IFilter API. See this article for more details.

Upvotes: 12

Related Questions