Reputation: 7961
I have location information provided by GeoNames.org parsed into a relational database. Using this information, I am attempting to build an ElasticSearch index that contains populated place (city) names, administrative division (state, province, etc.) names, country names and country codes. My goal is to provide a location search that is similar to Google Maps':
I don't need the cool bold highlighting, but I do need the search to return similar results in a similar way. I've tried creating a mapping with a single location field consisting of the entire location name (e.g., "Round Rock, TX, United States") and I've also tried having five separate fields consisting of each piece of a location. I've tried keyword and prefix queries and edgengram analyzers; I have been unsuccessful in finding the correct configuration to get this working properly.
What kinds of analyzers--both index and search--should I be looking at to accomplish my goals? This search doesn't have to be as perfected as Google's but I'd like it to be at least similar.
I do want to support partial-name matches, which is why I've been fiddling with edgengram. For example, a search of "round r" should match Round Rock, TX, United States. Also, I would prefer that results whose populated place (city) names begin with the exact search term be ranked higher than other results. For example, a search of "round ro" should match Round Rock, TX, United States before Round, Some Province, RO (Romania). I hope I've made this clear enough.
Here is my current index configuration (this is an anonymous type in C# that is later serialized to JSON and passed to the ElasticSearch API):
settings = new
{
index = new
{
number_of_shards = 1,
number_of_replicas = 0,
refresh_interval = -1,
analysis = new
{
analyzer = new
{
edgengram_index_analyzer = new
{
type = "custom",
tokenizer = "index_tokenizer",
filter = new[] { "lowercase", "asciifolding" },
char_filter = new[] { "no_commas_char_filter" },
stopwords = new object[0]
},
search_analyzer = new
{
type = "custom",
tokenizer = "standard",
filter = new[] { "lowercase", "asciifolding" },
char_filter = new[] { "no_commas_char_filter" },
stopwords = new object[0]
}
},
tokenizer = new
{
index_tokenizer = new
{
type = "edgeNGram",
min_gram = 1,
max_gram = 100
}
},
char_filter = new
{
no_commas_char_filter = new
{
type = "mapping",
mappings = new[] { ",=>" }
}
}
}
}
},
mappings = new
{
location = new
{
_all = new { enabled = false },
properties = new
{
populatedPlace = new { index_analyzer = "edgengram_index_analyzer", type = "string" },
administrativeDivision = new { index_analyzer = "edgengram_index_analyzer", type = "string" },
administrativeDivisionAbbreviation = new { index_analyzer = "edgengram_index_analyzer", type = "string" },
country = new { index_analyzer = "edgengram_index_analyzer", type = "string" },
countryCode = new { index_analyzer = "edgengram_index_analyzer", type = "string" },
population = new { type = "long" }
}
}
}
Upvotes: 3
Views: 3528
Reputation: 1323
This might be what you are looking for:
"analysis": {
"tokenizer": {
"name_tokenizer": {
"type": "edgeNGram",
"max_gram": 100,
"min_gram": 2,
"side": "front"
}
},
"analyzer": {
"name_analyzer": {
"tokenizer": "whitespace",
"type": "custom",
"filter": ["lowercase", "multi_words", "name_filter"]
},
},
"filter": {
"multi_words": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 10
},
"name_filter": {
"type": "edgeNGram",
"max_gram": 100,
"min_gram": 2,
"side": "front"
},
}
}
I think using name_analyzer
will replicate the google search that you are talking about. You can tweak the configuration a bit to suit your needs.
Upvotes: 2