Reputation: 33
I would like to index 1 million of html documents in Lucene. I need to index in one Lucene document several html files. Lately, I would like to know in the search response the original html document.
So, for example I have:
1.home.html
2.history.html
3.about.html
4.home2.html
...
I want to index 1, 2 and 3 in the same Lucene document. Then, if I search any text I want to know the original document (home, history or about).
I have been searching in Internet and I found Lucene payload. So I have been thinking about indexing the url of the original document in all the terms. Is this a good solution? the performance would be allright?
Thanks very much for your help.
Upvotes: 0
Views: 2156
Reputation: 333
They are two different lucene features:
1.Grouping : it allows to group search results by specified field. For example, if you group by the author field, then all documents with the same value in the author field fall into a single group. You will have a kind of tree as output.
2.facet: this feature doesn't group documents, it just tells you how many documents fall in a specific value of a facet. For example, if you have a facet based on the author field, you will receive a list of all your authors, and for each author you will know how many documents belong to that specific author. After, if you want to see those documents, you have to query one more time adding a specific filter (author=whatever). The faceted search is in fact based on browsing documents applying multiple filters to progressively reach the documents you're really interested in.
here is some tutorials
http://lucene.apache.org/core/4_3_1/facet/org/apache/lucene/facet/doc-files/userguide.html
http://lucene.apache.org/core/4_3_1/facet/org/apache/lucene/facet/search/package-summary.html
just go through it and work out as per your needs
Upvotes: 0
Reputation: 33
I have been working two days on this problem and I think I found the solution.
I index every html page in one document using an ID like for example:
1.home.html id1 htmlcontent
2.history.html id1 htmlcontent
3.about.html id1 htmlcontent
4.home2.html id2 htmlcontent
...
Lately I can make use org.apache.lucene.search.grouping to group the results by this id.
Hope this helps anybody :)
Upvotes: 1
Reputation: 1244
I think what you need is Apache Solr http://lucene.apache.org/solr/, its uses Lucene as indexing engine and has querying / web interface for searching.
look at this tutorial on the site http://lucene.apache.org/solr/4_3_1/tutorial.html
Upvotes: 0