Reputation: 17527
Questions like this have been asked lots (e.g. here, here, here, ...) and my inability to get what I need from those answers may just be me not understanding what Lucene means by "term" or "termdoc".
I build a Lucene index thus:
var db = new DataClassesDataContext();
var articles = (from article in db.Articles
orderby article.articleID ascending
select article).ToList();
var analyzer = new StandardAnalyzer(Version.LUCENE_30);
using (var writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
{
foreach (var article in articles)
{
var luceneDocument = new Document();
luceneDocument.Add(new Field("ArticleID", article.articleID.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
luceneDocument.Add(new Field("Title", article.title, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
luceneDocument.Add(new Field("Paragraph", article.paragraph, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
writer.AddDocument(luceneDocument);
}
Console.WriteLine("Optimizing index.");
writer.Optimize();
}
This works well and I can retrieve any term frequency vector. For example
var titleVector = indexReader.GetTermFreqVector(5001, "Title");
gives the result {Title: doing/1, healthcare/1, right/1}
. But I would like to enumerate the inverted index that maps words (like "doing", "healthcare", and "right") to the id's of the documents whose titles contain each word. I would like to build a CSV file where each row is something like word, ArticleID_1, ArticleID_2, ... , ArticleID_n
What I have so far doesn't work (it spits out all terms):
var terms = indexReader.Terms();
while (terms.Next())
{
Console.WriteLine(terms.Term.Text);
}
How do I get the list of all words that the index is using as terms from the "Title" field in my documents? I.e. how do I restrict that last code snippet to Title field terms only?
Upvotes: 1
Views: 1402
Reputation: 17527
Typical, no sooner had I written down the question than an answer formulated!
var terms = indexReader.Terms();
while (terms.Next())
{
if (terms.Term.Field == "Title")
{
var row = "\"" + terms.Term.Text + "\", ";
var termDocs = indexReader.TermDocs(terms.Term);
while (termDocs.Next())
{
row += indexReader[termDocs.Doc].Get("ArticleID") + ", ";
}
row.TrimEnd(new char[] { ',', ' ' });
titleFile.WriteLine(row);
}
}
Upvotes: 1