Reputation: 33

Lucene indexing html documents

I would like to index 1 million of html documents in Lucene. I need to index in one Lucene document several html files. Lately, I would like to know in the search response the original html document.

So, for example I have:

1.home.html
2.history.html
3.about.html

4.home2.html
...

I want to index 1, 2 and 3 in the same Lucene document. Then, if I search any text I want to know the original document (home, history or about).

I have been searching in Internet and I found Lucene payload. So I have been thinking about indexing the url of the original document in all the terms. Is this a good solution? the performance would be allright?

Thanks very much for your help.

Upvotes: 0

Answers (3)

myk.

Reputation: 333

They are two different lucene features:

1.Grouping : it allows to group search results by specified field. For example, if you group by the author field, then all documents with the same value in the author field fall into a single group. You will have a kind of tree as output.

http://lucene.apache.org/core/3_2_0/api/contrib-grouping/org/apache/lucene/search/grouping/package-summary.html

2.facet: this feature doesn't group documents, it just tells you how many documents fall in a specific value of a facet. For example, if you have a facet based on the author field, you will receive a list of all your authors, and for each author you will know how many documents belong to that specific author. After, if you want to see those documents, you have to query one more time adding a specific filter (author=whatever). The faceted search is in fact based on browsing documents applying multiple filters to progressively reach the documents you're really interested in.

here is some tutorials

http://lucene.apache.org/core/4_3_1/facet/org/apache/lucene/facet/doc-files/userguide.html

http://lucene.apache.org/core/4_3_1/facet/org/apache/lucene/facet/search/package-summary.html

just go through it and work out as per your needs

Upvotes: 0

Hibernator

Reputation: 33

I have been working two days on this problem and I think I found the solution.

I index every html page in one document using an ID like for example:

1.home.html     id1  htmlcontent
2.history.html  id1  htmlcontent
3.about.html    id1  htmlcontent

4.home2.html    id2  htmlcontent
...

Lately I can make use org.apache.lucene.search.grouping to group the results by this id.

http://lucene.apache.org/core/3_2_0/api/contrib-grouping/org/apache/lucene/search/grouping/package-summary.html

Hope this helps anybody :)

Upvotes: 1

Mehul Rathod

Reputation: 1244

I think what you need is Apache Solr http://lucene.apache.org/solr/, its uses Lucene as indexing engine and has querying / web interface for searching.

look at this tutorial on the site http://lucene.apache.org/solr/4_3_1/tutorial.html

Upvotes: 0

Lucene indexing html documents

Answers (3)

Related Questions