Why Lucene algorithm not working for Exact String in Java?

Question

I am working on Lucene Algorithm in Java. We have 100K stop names in MySQL Database. The stop names are like

NEW YORK PENN STATION, 
NEWARK PENN STATION,
NEWARK BROAD ST,
NEW PROVIDENCE
etc

When user gives a search input like NEW YORK, we get the NEW YORK PENN STATION stop in a result, but when user gives exact NEW YORK PENN STATION in a search input then it returns zero results.

My Code is -

public ArrayList getSimilarString(ArrayList source, String querystr)
  {
      ArrayList arResult = new ArrayList();

        try
        {
            // 0. Specify the analyzer for tokenizing text.
            //    The same analyzer should be used for indexing and searching
            StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);

            // 1. create the index
            Directory index = new RAMDirectory();

            IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);

            IndexWriter w = new IndexWriter(index, config);

            for(int i = 0; i < source.size(); i++)
            {
                addDoc(w, source.get(i), "1933988" + (i + 1) + "z");
            }

            w.close();

            // 2. query
            // the "title" arg specifies the default field to use
            // when no field is explicitly specified in the query.
            Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr + "*");

            // 3. search
            int hitsPerPage = 20;
            IndexReader reader = DirectoryReader.open(index);
            IndexSearcher searcher = new IndexSearcher(reader);
            TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
            searcher.search(q, collector);
            ScoreDoc[] hits = collector.topDocs().scoreDocs;

            // 4. Get results
            for(int i = 0; i < hits.length; ++i) 
            {
                  int docId = hits[i].doc;
                  Document d = searcher.doc(docId);
                  arResult.add(d.get("title"));
            }

            // reader can only be closed when there
            // is no need to access the documents any more.
            reader.close();

        }
        catch(Exception e)
        {
            System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
        }

        return arResult;

  }

  private static void addDoc(IndexWriter w, String title, String isbn) throws IOException 
  {
        Document doc = new Document();
        doc.add(new TextField("title", title, Field.Store.YES));

        // use a string field for isbn because we don't want it tokenized
        doc.add(new StringField("isbn", isbn, Field.Store.YES));
        w.addDocument(doc);
  }

In this code source is list of Stop Names and query is user given search input.

Does Lucene algorithm work on Large String?

Why Lucene algorithm is not working on Exact String?

phanin · Accepted Answer

Instead of

1) Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr + "*");

Ex: "new york station" will be parsed to "title:new title:york title:station". This query will return all the docs containing any of the above terms.

Try this..

2) Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse("+(" + querystr + ")");

Ex1: "new york" will be parsed to "+(title:new title:york)"

The above '+' indicates 'must' occurrence of the term in the result document. It will match both the docs containing "new york" and "new york station"

Ex2: "new york station" will be parsed to +(title:new title:york title:station). The query will match only "new york station" and not just "new york" since station is not present.

Please make sure that the field name 'title' is what you're looking for.

Your questions.

Does Lucene algorithm work on Large String?

You've got to define what a large string is. Are you actually looking for Phrase Search. In general, Yes, Lucene works for large strings.

Why Lucene algorithm is not working on Exact String?

Because parsing ("querystr" + "* ") will generate individual term queries with OR operator connecting them. Ex: 'new york*' will be parsed to: "title:new OR title:york*

If you are looking forward to find "new york station", the above wildcard query is not what you should be looking for. This is because the StandardAnalyser you passed in, while indexing, will tokenize (break down terms) new york station to 3 terms.

So, the query "york*" will find "york station" only because it has "york" in it but not because of the wildcard since "york" has no idea of "station" as they are different terms, i.e. different entries in the Index.

What you actually need is a PhraseQuery for finding exact string, for which the query string should be "new york" WITH the quotes

Why Lucene algorithm not working for Exact String in Java?

Answers (1)

Related Questions