Reputation: 49
I'm interested in using Solr to analyze documents and to obtain word frequencies for all document matching a particular criteria.
I tried termVectorComponent but I was only able to get term frequencies for individual documents not totals over groups of documents.
For example given the following data:
{
"id": "1",
"category": "cat1",
"includes": "The green car.",
},
{
"id": "2",
"category": "cat1",
"includes": "The red car.",
},
{
"id": "3",
"category": "cat2",
"includes": "The black car.",
}
I like to be able to get total term frequency counts per category. e.g.
<category name="cat1">
<lst name="the">2</lst>
<lst name="car">2</lst>
<lst name="green">1</lst>
<lst name="red">1</lst>
</category>
<category name="cat2">
<lst name="the">1</lst>
<lst name="car">1</lst>
<lst name="black">1</lst>
</category>
I tried using facets but I was unable to get them to combine word counts for individual documents as shown above. I noticed that termVector supports gives a document frequency for a terms use in the entire index but this is not useful to me. I need total frequency counts for just subsets of documents.
Does anyone have suggestions for how to get this information from Solr/Lucene?
Thanks in advance.
Upvotes: 1
Views: 671