Reputation: 110
This issue is probably due to my noobishness to ELK, Python, and Unicode.
I have an index containing logstash-digested logs, including a field 'host_req', which contains a host name. Using Elasticsearch-py, I'm pulling that host name out of the record, and using it to search in another index. However, if the hostname contains multibyte characters, it fails with a UnicodeDecodeError. Exactly the same query works fine when I enter it from the command line with 'curl -XGET'. The unicode character is a lowercase 'a' with a diaeresis (two dots). The UTF-8 value is C3 A4, and the unicode code point seems to be 00E4 (the language is Swedish).
These curl commands work just fine from the command line:
curl -XGET 'http://localhost:9200/logstash-2015.01.30/logs/_search?pretty=1' -d ' { "query" : {"match" :{"req_host" : "www.utkl\u00E4dningskl\u00E4derna.se" }}}'
curl -XGET 'http://localhost:9200/logstash-2015.01.30/logs/_search?pretty=1' -d ' { "query" : {"match" :{"req_host" : "www.utklädningskläderna.se" }}}'
They find and return the record
(the second line shows how the hostname appears in the log I pull it from, showing the lowercase 'a' with a diaersis, in two places)
I've written a very short Python script to show the problem: It uses hardwired queries, printing them and their type, then trying to use them in a search.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import json
import elasticsearch
es = elasticsearch.Elasticsearch()
if __name__=="__main__":
#uq = u'{ "query": { "match": { "req_host": "www.utklädningskläderna.se" }}}' # raw utf-8 characters. does not work
#uq = u'{ "query": { "match": { "req_host": "www.utkl\u00E4dningskl\u00E4derna.se" }}}' # quoted unicode characters. does not work
#uq = u'{ "query": { "match": { "req_host": "www.utkl\uC3A4dningskl\uC3A4derna.se" }}}' # quoted utf-8 characters. does not work
uq = u'{ "query": { "match": { "req_host": "www.facebook.com" }}}' # non-unicode. works fine
print "uq", type(uq), uq
result = es.search(index="logstash-2015.01.30",doc_type="logs",timeout=1000,body=uq);
if result["hits"]["total"] == 0:
print "nothing found"
else:
print "found some"
If I run it as shown, with the 'facebook' query, it's fine - the output is:
$python testutf8b.py
uq <type 'unicode'> { "query": { "match": { "req_host": "www.facebook.com" }}}
found some
Note that the query string 'uq' is unicode.
But if I use the other three strings, which include the Unicode characters, it blows up. For example, with the second line, I get:
$python testutf8b.py
uq <type 'unicode'> { "query": { "match": { "req_host": "www.utklädningskläderna.se" }}}
Traceback (most recent call last):
File "testutf8b.py", line 15, in <module>
result = es.search(index="logstash-2015.01.30",doc_type="logs",timeout=1000,body=uq);
File "build/bdist.linux-x86_64/egg/elasticsearch/client/utils.py", line 68, in _wrapped
File "build/bdist.linux-x86_64/egg/elasticsearch/client/__init__.py", line 497, in search
File "build/bdist.linux-x86_64/egg/elasticsearch/transport.py", line 307, in perform_request
File "build/bdist.linux-x86_64/egg/elasticsearch/connection/http_urllib3.py", line 82, in perform_request
elasticsearch.exceptions.ConnectionError: ConnectionError('ascii' codec can't decode byte 0xc3 in position 45: ordinal not in range(128)) caused by: UnicodeDecodeError('ascii' codec can't decode byte 0xc3 in position 45: ordinal not in range(128))
$
Again, note that the query string is a unicode string (yes, the source code line is the one with the \u00E4
characters).
I'd really like to resolve this. I've tried various combinations of uq = uq.encode("utf=8")
and uq = uq.decode("utf=8")
, but it doesn't seem to help. I'm starting to wonder if there's an issue in the elasticsearch-py
library.
thanks!
pt
PS: This is under Centos 7, using ES 1.5.0. The logs were digested into ES under a slightly older version, using logstash-1.4.2
Upvotes: 0
Views: 4134
Reputation: 8572
Basically, you dont need to pass body
as string. Use native python datastructures. Or transform them on the fly. Give a try, pls:
>>> import elasticsearch
>>> es = elasticsearch.Elasticsearch()
>>> es.index(index='unicode-index', body={'host': u'www.utklädningskläderna.se'}, doc_type='log')
{u'_id': u'AUyGJuFMy0qdfghJ6KwJ',
u'_index': u'unicode-index',
u'_type': u'log',
u'_version': 1,
u'created': True}
>>> es.search(index='unicode-index', body={}, doc_type='log')
{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ',
u'_index': u'unicode-index',
u'_score': 1.0,
u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'},
u'_type': u'log'}],
u'max_score': 1.0,
u'total': 1},
u'timed_out': False,
u'took': 5}
>>> es.search(index='unicode-index', body={'query': {'match': {'host': u'www.utklädningskläderna.se'}}}, doc_type='log')
{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ',
u'_index': u'unicode-index',
u'_score': 0.30685282,
u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'},
u'_type': u'log'}],
u'max_score': 0.30685282,
u'total': 1},
u'timed_out': False,
u'took': 122}
>>> import json
>>> body={'query': {'match': {'host': u'www.utklädningskläderna.se'}}}
>>> es.search(index='unicode-index', body=body, doc_type='log')
{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ',
u'_index': u'unicode-index',
u'_score': 0.30685282,
u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'},
u'_type': u'log'}],
u'max_score': 0.30685282,
u'total': 1},
u'timed_out': False,
u'took': 4}
>>> es.search(index='unicode-index', body=json.dumps(body), doc_type='log')
{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ',
u'_index': u'unicode-index',
u'_score': 0.30685282,
u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'},
u'_type': u'log'}],
u'max_score': 0.30685282,
u'total': 1},
u'timed_out': False,
u'took': 5}
>>> json.dumps(body)
'{"query": {"match": {"host": "www.utkl\\u00e4dningskl\\u00e4derna.se"}}}'
Upvotes: 2