Is it possible to find common words in specific Lucene documents?

Question

For example:

doc1 = "I got the new Apple iPhone 8";
doc2 = "have you seen the  new Apple iPhone 8?";
doc3 = "the Apple iPhone 8 is out";
doc4 = "another doc without the common words";

find_commons(["doc1", "doc2", "doc3", "doc4"]);

results: {{"doc1", "doc2", "doc3"}, {"Apple", "iPhone"}} or something similar

Other question: is there a better library/system to achieve this using Lucene's data?

Philipp Ludwig · Accepted Answer

Yes, you can use the TermVector to retrieve this information.

First, you need to make sure that the TermVectors are stored in the index, e.g.:

private static Document createDocument(String title, String content) {
    Document doc = new Document();

    doc.add(new StringField("title", title, Field.Store.YES));
    FieldType type = new FieldType();
    type.setTokenized(true);
    type.setStoreTermVectors(true);
    type.setStored(false);
    type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
    doc.add(new Field("content", content, type));

    return doc;
}

Then, you can retrieve the term vector for a given document id:

private static List getTermsForDoc(int docId, String field, IndexReader reader) throws IOException {
    List result = new ArrayList<>();

    Terms terms = reader.getTermVector(docId, field);
    TermsEnum it = terms.iterator();
    for(BytesRef br = it.next(); br != null; br = it.next()) {
        result.add(br.utf8ToString());
    }

    return result;
}

Finally you can retrieve common terms for two documents:

private static List getCommonTerms(int docId1, int docId2, IndexSearcher searcher) throws IOException {
    // Using the field "content" is just an example here.
    List termList1 = getTermsForDoc(docId1, "content", searcher);
    List termList2 = getTermsForDoc(docId2, "content", searcher);

    termList1.retainAll(termList2);
    return termList1;
}

Of course this can easily be expanded to allow an arbitrary number of documents.

Is it possible to find common words in specific Lucene documents?

For example:

Answers (1)

Related Questions