Reputation: 13335
I'm trying to create a search engine for all literature (books, articles, etc), music, and videos relating to a particular spiritual group. When a keyword is entered, I want to display a link to all the PDF articles where the keyword appears, and also all the music files and video files which are tagged with the keyword in question. The user should be able to filter it with information such as author/artist, place, date/time, etc. When the user clicks on one of the results links (book names, for instance), they are taken to another page where snippets from that book everywhere the keyword is found are displayed.
I thought of using the Lucene library (or Searcharoo) to implement my PDF search, but I also need a database to tag all the other information so that results can be filtered by author/artist information, etc. So I was thinking of having tables for Text, Music, and Videos, and a field containing the path to the file for each. When a keyword is entered, I need to search the DB for music and video files, and also need to search the PDF's, and when a filter is applied, the music and video search is easy, but limiting the text search based on the filters is getting confusing.
Is my approach correct? Are there better ways to do this? Since the search content is limited only to the spiritual group, there is not an infinite number of items to search. I'd say about 100-500 books and 1000-5000 songs.
Upvotes: 3
Views: 2210
Reputation: 4786
Lucene is a great way to get up and running quickly without too much effort, along with several areas for extending the indexing and searching functionality to better suit your needs. It also has several built-in analyzers for common file types, such as HTML/XML, PDF, MS Word Documents, etc.
It provides the ability to use a variety of Fields, and they don't necessarily have to be uniform across all Documents (in other words, music files might have different attributes than text-based content, such as artist, title, length, etc.), which is great for storing different types of content.
Not knowing the exact implementation of what you're working on, this may or may not be feasible, but for tagging and other related features, you might also consider using a database, such as MySQL or SQL Server side-by-side with the Lucene index. Use the Lucene index for full-text search, then once you have a result set, go to the database to extract all the relational content. Our company has done this before, and it's actually not as big of a headache as it sounds.
NOTE: If you decide to go this route, BE CAREFUL, as the "unique id" provided by Lucene is highly volatile (it changes everytime the index is optimized), so you will want to store the actual id (the primary key in the database) as a separate field on the Document.
Another added benefit, if you are set on using C#.NET, there is a port called Lucene.Net, which is written entirely in C#. The down-side here is that you're a few months behind on all the latest features, but if you really need them, you can always check out the Java source and implement the required updates manually.
Upvotes: 6
Reputation: 25339
If you definitely want to go the database route then you should use SQL Server with Full Text Search enabled. You can use this with Express versions, too. You can then store and search the contents of PDFs very easily (so long as you install the free Adobe PDF iFilter).
Upvotes: 1
Reputation: 65391
You could try using MS Search Server Express Edition, one of the major benefits is that it is free.
http://www.microsoft.com/enterprisesearch/en/us/search-server-express.aspx#none
Upvotes: 1