Reputation: 501
doc1 = "I got the new Apple iPhone 8";
doc2 = "have you seen the new Apple iPhone 8?";
doc3 = "the Apple iPhone 8 is out";
doc4 = "another doc without the common words";
find_commons(["doc1", "doc2", "doc3", "doc4"]);
results: {{"doc1", "doc2", "doc3"}, {"Apple", "iPhone"}}
or something similar
Other question: is there a better library/system to achieve this using Lucene's data?
Upvotes: 1
Views: 405
Reputation: 4184
Yes, you can use the TermVector
to retrieve this information.
First, you need to make sure that the TermVectors are stored in the index, e.g.:
private static Document createDocument(String title, String content) {
Document doc = new Document();
doc.add(new StringField("title", title, Field.Store.YES));
FieldType type = new FieldType();
type.setTokenized(true);
type.setStoreTermVectors(true);
type.setStored(false);
type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
doc.add(new Field("content", content, type));
return doc;
}
Then, you can retrieve the term vector for a given document id:
private static List<String> getTermsForDoc(int docId, String field, IndexReader reader) throws IOException {
List<String> result = new ArrayList<>();
Terms terms = reader.getTermVector(docId, field);
TermsEnum it = terms.iterator();
for(BytesRef br = it.next(); br != null; br = it.next()) {
result.add(br.utf8ToString());
}
return result;
}
Finally you can retrieve common terms for two documents:
private static List<String> getCommonTerms(int docId1, int docId2, IndexSearcher searcher) throws IOException {
// Using the field "content" is just an example here.
List<String> termList1 = getTermsForDoc(docId1, "content", searcher);
List<String> termList2 = getTermsForDoc(docId2, "content", searcher);
termList1.retainAll(termList2);
return termList1;
}
Of course this can easily be expanded to allow an arbitrary number of documents.
Upvotes: 1