Reputation: 1970
I am trying to run a faceted query on some logs that I have stored in ES. The logs look something like
{"severity": "informational","message_hash_value": "00016B15", "user-agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1", "host": "192.168.8.225", "version": "1.0", "user": "[email protected]", "created_timestamp": "2013-03-01T15:34:00", "message": "User viewed contents", "inserted_timestamp": "2013-03-01T15:34:00"}
The query that i am trying to run is
curl -XGET 'http://127.0.0.1:9200/logs-*/logs/_search'
-d {"from":0, "size":0,
"facets" : {
"user" : {
"terms" : {"field" : "user", "size" : 999999 } } } }
Notice that the field "user"
in the logs is an email address. Now the problem is that the terms-facet
search query i use returns a list of terms from the users field as given below.
u'facets': {u'user': {u'_type': u'terms', u'total': 2004, u'terms': [{u'count': 1002,u'term': u'test.co'}, {u'count': 320, u'term': u'user_1'}, {u'count': 295,u'term': u'user_2'}
Note that that list contains the term
{u'count': 1002,u'term': u'test.co'}
which is the domain name for the email addresses of the users. Why is elasticsearch treating the domain as a seperate term?
Running a query to check the mappings
curl -XGET 'http://127.0.0.1:9200/logs-*/_mapping?pretty=true'
yields the following for the "user"
field
"user" : {
"type" : "string"
},
Upvotes: 1
Views: 741
Reputation: 743
This happens because elasticsearch's default global analyzer tokenizes "@" (in addition to things like whitespace and punctuation) at index time. You can get around this issue by telling elasticsearch not to run an analyzer on this field, but you will have to reindex all of your data.
Create your new index
curl -XPUT 'http://localhost:9200/logs-new'
Specify in this new index's mapping that you don't want to analyze the "user" field
curl -XPUT 'http://localhost:9200/logs-new/logs/_mapping' -d '{
"logs" : {
"properties" : {
"user" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
}'
Index a document
curl -XPOST 'http://localhost:9200/logs-new/logs' -d '{
"created_timestamp": "2013-03-01T15:34:00",
"host": "192.168.8.225",
"inserted_timestamp": "2013-03-01T15:34:00",
"message": "User viewed contents",
"message_hash_value": "00016B15",
"severity": "informational",
"user": "[email protected]",
"user-agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1",
"version": "1.0"
}'
The elasticsearch facet will now display the entire email address
curl -XGET 'http://localhost:9200/logs-new/logs/_search?pretty' -d '{
"from":0,
"size":0,
"facets" : {
"user" : {
"terms" : {
"field" : "user",
"size" : 999999
}
}
}
}'
Result:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ ]
},
"facets" : {
"user" : {
"_type" : "terms",
"missing" : 0,
"total" : 1,
"other" : 0,
"terms" : [ {
"term" : "[email protected]",
"count" : 1
} ]
}
}
}
References: Core Types: http://www.elasticsearch.org/guide/reference/mapping/core-types/ Reindexing with a new mapping: https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/tCaXgjfUFVU
Upvotes: 2