Reputation: 281
I am trying to create a search engine just to learn and get more experience in Java.
My intention is to store about 100 files on a server, a mixture of html, xml, doc, txt, and for each file to have meta data.
SO when i search for a keyword, it should display a file with its meta description like Google.
My question is, apart from html, can you add meta data to any other file formats, so that the meta description is shown.
Would you be able to point me towards a Java search engine, that can search within file formats (txt,html) and display the result.
I am working on my own code for this, but would like to have a look at other peoples code for some help?
Upvotes: 11
Views: 28073
Reputation: 160271
Lucene is the canonical Java search engine.
For adding documents from a variety of sources, take a look at Apache Tika and for a full-blown system with service/web interfaces, solr.
Lucene allows arbitrary metadata to be associated with its documents. Tika will automatically cull metadata from a variety of formats.
Upvotes: 27
Reputation: 1757
You'll have to use several libraries. First of all, as many people mentioned before you can use Lucene to do the actual searching. However, Lucene only handles plain text, so you need to extract this from the files you index. For this, you could use Apache Tika.
To get started, you should probably buy the book Lucene in Action 2nd edition. Most of the examples in there are still up to date. If you want to be a cheapskate you could also just look at the provided source code on that page.
Upvotes: 3
Reputation: 11475
Apache Tika to extract metadata.
Apache Tika The Apache Tika toolkit is an ASFv2 licensed open source tool for extracting information from digital documents. Tika allows search engines, content management systems and other applications that work with various kinds of digital documents to easily detect and extract metadata and content from all major file formats.
Upvotes: 2
Reputation: 7576
... lucene and solr come to mind as far other people's code is concerned.
Upvotes: 3
Reputation: 85536
Upvotes: 3
Reputation: 25150
Look at apache nutch
Apache Nutch is an open source web-search software project.
Nutch builds on top of lucene/solr for indexing, tika for parsing documents, and adds its own web crawler.
Upvotes: 4
Reputation: 23644
The really good is Lucene. There are lot of plugins (that would allow for example you read from .doc), support multiple languages and lot of algorithms (like Levenshtein distance)
Upvotes: 3
Reputation: 88747
1)My question is apart from html can you add meta data to any other file formats, so that the meta description is shown.
In general you would use a database and store the metadata along with the document there. You'd then do a keyword search using a database query (possibly using SQL like or ilike).
The files might either be stored on the harddrive with just paths in the DB or put into the database as either CLOB or BLOB, depending on whether you have text or binary documents.
2) Would you be able to point be towards a Java search engine, that can search within file formats (txt,html) and displays the result.
Try Apache Lucene.
Upvotes: 5