Mahesh
Mahesh

Reputation: 93

Lucene is not returning the results if I am searching with special characters

I am using Lucene 6.6.0 version, and I am indexing my data using StandardAnalyzer.

I am indexing following data of words.

  1. a&e networks
  2. a&e

After indexing , when I am searching with a&e it is not returning any results. this is my sample code.

    Directory dir = new RAMDirectory();
    IndexWriterConfig iwc = new IndexWriterConfig(new StandardAnalyzer());
    iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
    IndexWriter writer = new IndexWriter(dir, iwc);

    Document doc = new Document();
    doc.add(new TextField("text", "a&e networks", Field.Store.YES));
    writer.addDocument(doc);
    doc = new Document();
    doc.add(new TextField("text", "a&e", Field.Store.YES));
    writer.addDocument(doc);
    writer.close();

    IndexReader reader = DirectoryReader.open(dir);

    IndexSearcher searcher = new IndexSearcher(reader);

    Query query = new TermQuery(new Term("text", "a&e"));

    TopDocs results = searcher.search(query, 5);
    final ScoreDoc[] scoreDocs = results.scoreDocs;
    for (ScoreDoc scoreDoc : scoreDocs) {
        System.out.println(scoreDoc.doc + " " + scoreDoc.score + " " + searcher.doc(scoreDoc.doc).get("text"));
    }
    System.out.println("Hits: " + results.totalHits);
    System.out.println("Max score:" + results.getMaxScore());

I am getting output as Hits: 0 Max score:NaN

Even I am searching for a also it is not giving any results in this case.

but if I add stopwords set to StandardAnalyzer like this

    List<String> stopWords = Arrays.asList("&");
    CharArraySet stopSet = new CharArraySet(stopWords, false);
    IndexWriterConfig iwc = new IndexWriterConfig(new StandardAnalyzer(stopSet));

and after that if i search for a then I am able to get the results. but even in that case also if i search for a&e , then I am not getting any results.

please suggest me how to achieve this, my goal here is if I search for a&e I should be able to get the results. do I need to any CustomAnalyzer ? If so please explain what should I add in CustomAnalyzer?

Upvotes: 1

Views: 220

Answers (1)

hkn
hkn

Reputation: 1448

Probably & character is considered as a word boundary:

https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/analysis/standard/StandardTokenizer.html

This class implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

a and e are probably considered as stop word. So when indexed they are removed.

You can try some randomly generated keywords seperated by & character (eg. adsadaerewfds&eqeqwedasd). After indexing try to search keywords before and after &. If those keywords are found either store them without analyzing (you can use StringField) or create custom analyzer.

Upvotes: 1

Related Questions