Reputation: 1073

How to get records starting with a specific letter in Lucene

I have a cached name list that I store to the Lucene data structure. I want to get people whose name starts with a specific letter.

for example: My list is below. I store them into the name field.

foo bar
blabla foo
foo2 bar
test data

when I search with name:f* it returns foo bar, foo2 bar and blabla foo. It checks every words in the field and gets blabla foo too. But I need to get names start with f, the very first letter of it is f, not records contain words starting with f even if they are at the end of the sentence.

Any idea ?

Upvotes: 0

Answers (3)

jrey

Reputation: 2183

Wildcard Searches

Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries).

To perform a single character wildcard search use the "?" symbol.

To perform a multiple character wildcard search use the "*" symbol.

The single character wildcard search looks for terms that match that with the single character replaced. For example, to search for "text" or "test" you can use the search:

te?t Multiple character wildcard searches looks for 0 or more characters. For example, to search for test, tests or tester, you can use the search:

test*

example, with regex

RegexQuery query = new RegexQuery(newTerm("^a.*$"));


query.setRegexImplementation(new JavaUtilRegexCapabilities());

return searcher.search(query, null, 1000).totalHits;

http://lucene.apache.org/core/4_3_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description

example code:

        BasicConfigurator.configure();

        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);

        // Store the index in memory:
        Directory directory = new RAMDirectory();
        // To store an index on disk, use this instead:
        // Directory directory = FSDirectory.open(new
        // File("./lucene/data"));
        IndexWriterConfig config = new IndexWriterConfig(
                Version.LUCENE_CURRENT, analyzer);
        IndexWriter iwriter;

        iwriter = new IndexWriter(directory, config);

        String[] words = { "Olimpia", "Cerro", "Olimpo", "Libertad",
                "Nacional", "Sol", "O'higgins", "Sao Paulo",
                "Oriente Petrolero", "Barrio Obrero", "B. Obrero" };

        for (String word : words) {
            Document doc = new Document();
            String text = word;
            doc.add(new Field("name", text, Field.Store.YES,
                    Field.Index.NOT_ANALYZED));

            // ,Field.Store.NO, Field.Index.NOT_ANALYZED
            iwriter.addDocument(doc);
        }

        iwriter.close();

        // Now search the index:

        logger.info("HelloLucene.main: query2 -----------");

        DirectoryReader ireader2 = DirectoryReader.open(directory);
        IndexSearcher isearcher2 = new IndexSearcher(ireader2);

        logger.info("HelloLucene.main: query2 -----------");
        RegexQuery query2 = new RegexQuery(new Term("name", "O.*"));
        query2.setRegexImplementation(new JavaUtilRegexCapabilities(
                JavaUtilRegexCapabilities.FLAG_CASE_INSENSITIVE));

        ScoreDoc[] hits2 = isearcher2.search(query2, null, 1000).scoreDocs;
        for (int i = 0; i < hits2.length; i++) {
            Document hitDoc = isearcher2.doc(hits2[i].doc);
            logger.info("HelloLucene.main: starting with O = "
                    + hitDoc.get("name"));

        }

Upvotes: 1

Jayendra

Reputation: 52809

Would suggest using the field without tokenization.
Also instead of using the wildcard search use the EdgeNGramTokenFilter which would produce tokens and would be much faster then the wildcard searches as it would happen at index time.

Upvotes: 1

femtoRgon

Reputation: 33351

That's how Lucene oeprates, by default. If tokenizes fields into terms, and you search for terms that occur anywhere in the field. For large text documents, this makes absolute sense, as you would likely never want to just search from the beginning of a large body of text.

If you want to be able to search as a literal string, rather than a tokenized set of terms, the best solution is to index it in a way that supports that well. A Solr.StrField is a typical choice of type for that, rather than TextField.

Upvotes: 0

How to get records starting with a specific letter in Lucene

Answers (3)

Related Questions