TheMadCapLaughs27
TheMadCapLaughs27

Reputation: 346

Filtering term count in Lucene (Java)

I'm currently trying to get the amount of appearences of each word in a description field using Lucene. F.e.

output:

I am looking to get the word and the frequency.

The thing is I would like to filter those results to a given document, I mean only count the words in the description field of a given document.

Thanks for any assistance given.

//in answer to comment: I have something like this:

public ArrayList<ObjectA> GetIndexTerms(String code) {
        try {

            ArrayList<Object> termlist = new ArrayList<ObjectA>();
            indexR = IndexReader.open(path); 
            TermEnum terms = indexR.terms();           

            while (terms.next()) {
                Term term = terms.term();
                String termText = term.text();                    
                int frequency = indexR.docFreq(term); 
                ObjectA newObj = new ObjectA(termText, frequency);
                termlist.add(newObj);                      
                }                   
            }               
            return termlist;
        } catch (Exception ex) {               
            ex.printStackTrace();
            return null;
        }
}

But i don't see how to filter it by document...


//TODAY!

Using the termfreqvec I can get it to work but it takes de doc id and I can't use it right. Since I used a query de "i" value starts in 0 and that's not the proper doc id. Any ideas to get this working properly? Thanks!

    TopDocs tp = indexS.search(query, Integer.MAX_VALUE);
        for (int i = 0; i < tp.scoreDocs.length; i++){  
            ScoreDoc sds = tp.scoreDocs[i];
            Document doc = indexS.doc(sds.doc);
            TermFreqVector tfv = indexR.getTermFreqVector(i,"description");

            for (int j = 0; j < tfv.getTerms().length; j++) {
                String item = tfv.getTerms()[j];
                termlist.add(new TerminoDescripcion(item.toUpperCase(), tfv.getTermFrequencies()[j]));
            }
        }

Upvotes: 0

Views: 901

Answers (1)

jpountz
jpountz

Reputation: 9964

The problem is that Lucene is an inverted index, meaning that it makes it easy to retrieve documents based on terms, whereas you are looking for the opposite, i.e. retrieveing terms based on documents.

Hopefully, this is a recurrent problem and Lucene gives you the ability to retrieve terms for a document (term vectors) provided that you enabled this feature at indexing time.

See TermVector.YES and Field constructor to know how to enable them at indexing time and IndexReader to know how to retrieve term vectors at search time.

Alternatively, you could re-analyze a stored field on the fly, but this may be slower, especially on large fields.

Upvotes: 2

Related Questions