Reputation: 93
I am using Lucene 6.6.0 version, and I am indexing my data using StandardAnalyzer.
I am indexing following data of words.
After indexing , when I am searching with a&e it is not returning any results. this is my sample code.
Directory dir = new RAMDirectory();
IndexWriterConfig iwc = new IndexWriterConfig(new StandardAnalyzer());
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
IndexWriter writer = new IndexWriter(dir, iwc);
Document doc = new Document();
doc.add(new TextField("text", "a&e networks", Field.Store.YES));
writer.addDocument(doc);
doc = new Document();
doc.add(new TextField("text", "a&e", Field.Store.YES));
writer.addDocument(doc);
writer.close();
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
Query query = new TermQuery(new Term("text", "a&e"));
TopDocs results = searcher.search(query, 5);
final ScoreDoc[] scoreDocs = results.scoreDocs;
for (ScoreDoc scoreDoc : scoreDocs) {
System.out.println(scoreDoc.doc + " " + scoreDoc.score + " " + searcher.doc(scoreDoc.doc).get("text"));
}
System.out.println("Hits: " + results.totalHits);
System.out.println("Max score:" + results.getMaxScore());
I am getting output as Hits: 0 Max score:NaN
Even I am searching for a also it is not giving any results in this case.
but if I add stopwords set to StandardAnalyzer like this
List<String> stopWords = Arrays.asList("&");
CharArraySet stopSet = new CharArraySet(stopWords, false);
IndexWriterConfig iwc = new IndexWriterConfig(new StandardAnalyzer(stopSet));
and after that if i search for a then I am able to get the results. but even in that case also if i search for a&e , then I am not getting any results.
please suggest me how to achieve this, my goal here is if I search for a&e I should be able to get the results. do I need to any CustomAnalyzer ? If so please explain what should I add in CustomAnalyzer?
Upvotes: 1
Views: 220
Reputation: 1448
Probably &
character is considered as a word boundary:
https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/analysis/standard/StandardTokenizer.html
This class implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.
a
and e
are probably considered as stop word. So when indexed they are removed.
You can try some randomly generated keywords seperated by &
character (eg. adsadaerewfds&eqeqwedasd). After indexing try to search keywords before and after &
. If those keywords are found either store them without analyzing (you can use StringField) or create custom analyzer.
Upvotes: 1