Reputation: 317
I have an index of news articles, where i save title,link,description of news.. sometimes its possible that same news from same link is published with different titles by different news sources. it don't want exactly same description articles to be added twice..how to find if document already exists?
Upvotes: 4
Views: 4052
Reputation: 1307
I'm assuming you're working with Java. Assuming your link is being saved in the index as a StringField (so any analyzers you use won't break up the link into multiple terms), you can use a TermQuery.
TopDocs results = searcher.search(new TermQuery(new Term("link", "http://example.com")), 1);
if (results.totalHits == 0){
Document doc = new Document();
// create your document here with your fields
// link field should be stored as a StringField
doc.add(new StringField("link", "http://example.com", Stored.YES));
writer.addDocument(doc);
}
Note that StringFields are stored exactly so you may want to convert to lowercase when searching/indexing.
If you wish to make sure not more than 1 field already exists, then you can run it as a BooleanQuery using the Occur.SHOULD condition:
BooleanQuery matchingQuery = new BooleanQuery();
matchingQuery.add(new TermQuery(new Term("link", "http://example.com")), Occur.SHOULD);
matchingQuery.add(new TermQuery(new Term("description", "the unique description of the article")), Occur.SHOULD);
TopDocs results = searcher.search(matchingQuery, 1);
if (results.totalHits == 0){
Document doc = new Document();
// create your document here with your fields
// link field should be stored as a StringField
doc.add(new StringField("link", "http://example.com", Stored.YES));
doc.add(new StringField("description", "the unique description of the article", Stored.YES));
// note if you need the description to be tokenized, you need to add another TextField to the document with a different field name
doc.add(new TextField("descriptionText", "the unique description of the article", Stored.NO));
writer.addDocument(doc);
}
Upvotes: 4