David Larochelle
David Larochelle

Reputation: 49

Getting total word frequencies for a subset of documents in Solr

I'm interested in using Solr to analyze documents and to obtain word frequencies for all document matching a particular criteria.

I tried termVectorComponent but I was only able to get term frequencies for individual documents not totals over groups of documents.

For example given the following data:

  {
    "id": "1",
    "category": "cat1",
    "includes": "The green car.",
  },
  {
    "id": "2",
    "category": "cat1",
    "includes": "The red car.",
  },
  {
    "id": "3",
    "category": "cat2",
    "includes": "The black car.",
  }

I like to be able to get total term frequency counts per category. e.g.

<category name="cat1">
   <lst name="the">2</lst>
   <lst name="car">2</lst>
   <lst name="green">1</lst>
   <lst name="red">1</lst>
</category>
<category name="cat2">
   <lst name="the">1</lst>
   <lst name="car">1</lst>
   <lst name="black">1</lst>
</category>

I tried using facets but I was unable to get them to combine word counts for individual documents as shown above. I noticed that termVector supports gives a document frequency for a terms use in the entire index but this is not useful to me. I need total frequency counts for just subsets of documents.

Does anyone have suggestions for how to get this information from Solr/Lucene?

Thanks in advance.

Upvotes: 1

Views: 671

Answers (1)

djm
djm

Reputation: 342

I found this link; you'll have to modify TermsComponent.java link (solrJ perhaps?)

I've never tried it, but could you also use a functionquery (i.e. sum) to add up the tv.df values? Here's a full list of functionqueries link

Upvotes: 0

Related Questions