synack
synack

Reputation: 1749

How to index a WEB TREC collection?

I've build a WEB TREC collection by downloading and parsing html pages by myself. Each TREC file contains a Category field. How can I build an index by using Lucene in order to perform a search in that collection? The idea is that this search, instead of returning documents as results, it could return categories.

Thank you!

Upvotes: 0

Views: 991

Answers (1)

Mikos
Mikos

Reputation: 8553

This should be a relatively simple task since you have them in HTML format. You could index them in Lucene thus (Java based pseudo code)

foreach(file in htmlfiles)
{
 Document d = new Document();
 d.add(new Field("Category", GetCategoryName(...), Field.Store.YES,  Field.Index.NOT_ANALYZED));
d.add(new Field("Contents", GetContents(...), Field.Store.YES, Field.Index.ANALYZED));

writer.addDocument(d);
writer.close();
}

GetCategoryName = should return the category string and GetContents(...) the contents of corresponding HTML file.It would be a good idea to parse out the HTML contents from the tags there are several ways of doing it. HtmlParser being one.

When you search, search the contents field and iterate through your search results to collect your Categories.

If you want to get a list of categories with counts attached ("facets") look into faceted search. Solr is a search server built using Lucene that provides this out of the box.

Upvotes: 1

Related Questions