Reputation: 1073
I have a cached name list that I store to the Lucene data structure. I want to get people whose name starts with a specific letter.
for example:
My list is below. I store them into the name
field.
foo bar
blabla foo
foo2 bar
test data
when I search with name:f*
it returns foo bar
, foo2 bar
and blabla foo
. It checks every words in the field and gets blabla foo
too. But I need to get names start with f
, the very first letter of it is f
, not records contain words starting with f
even if they are at the end of the sentence.
Any idea ?
Upvotes: 0
Views: 3684
Reputation: 2183
Wildcard Searches
Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries).
To perform a single character wildcard search use the "?" symbol.
To perform a multiple character wildcard search use the "*" symbol.
The single character wildcard search looks for terms that match that with the single character replaced. For example, to search for "text" or "test" you can use the search:
te?t Multiple character wildcard searches looks for 0 or more characters. For example, to search for test, tests or tester, you can use the search:
test*
example, with regex
RegexQuery query = new RegexQuery(newTerm("^a.*$"));
query.setRegexImplementation(new JavaUtilRegexCapabilities());
return searcher.search(query, null, 1000).totalHits;
example code:
BasicConfigurator.configure();
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
// Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
// Directory directory = FSDirectory.open(new
// File("./lucene/data"));
IndexWriterConfig config = new IndexWriterConfig(
Version.LUCENE_CURRENT, analyzer);
IndexWriter iwriter;
iwriter = new IndexWriter(directory, config);
String[] words = { "Olimpia", "Cerro", "Olimpo", "Libertad",
"Nacional", "Sol", "O'higgins", "Sao Paulo",
"Oriente Petrolero", "Barrio Obrero", "B. Obrero" };
for (String word : words) {
Document doc = new Document();
String text = word;
doc.add(new Field("name", text, Field.Store.YES,
Field.Index.NOT_ANALYZED));
// ,Field.Store.NO, Field.Index.NOT_ANALYZED
iwriter.addDocument(doc);
}
iwriter.close();
// Now search the index:
logger.info("HelloLucene.main: query2 -----------");
DirectoryReader ireader2 = DirectoryReader.open(directory);
IndexSearcher isearcher2 = new IndexSearcher(ireader2);
logger.info("HelloLucene.main: query2 -----------");
RegexQuery query2 = new RegexQuery(new Term("name", "O.*"));
query2.setRegexImplementation(new JavaUtilRegexCapabilities(
JavaUtilRegexCapabilities.FLAG_CASE_INSENSITIVE));
ScoreDoc[] hits2 = isearcher2.search(query2, null, 1000).scoreDocs;
for (int i = 0; i < hits2.length; i++) {
Document hitDoc = isearcher2.doc(hits2[i].doc);
logger.info("HelloLucene.main: starting with O = "
+ hitDoc.get("name"));
}
Upvotes: 1
Reputation: 52809
Would suggest using the field without tokenization.
Also instead of using the wildcard search use the EdgeNGramTokenFilter which would produce tokens and would be much faster then the wildcard searches as it would happen at index time.
Upvotes: 1
Reputation: 33351
That's how Lucene oeprates, by default. If tokenizes fields into terms, and you search for terms that occur anywhere in the field. For large text documents, this makes absolute sense, as you would likely never want to just search from the beginning of a large body of text.
If you want to be able to search as a literal string, rather than a tokenized set of terms, the best solution is to index it in a way that supports that well. A Solr.StrField is a typical choice of type for that, rather than TextField.
Upvotes: 0