Reputation: 1079
I'm using elasticsearch in a python web app in order to query news documents. There're actually 100000 documents in the database.
The original db is a mongo one and elasticsearch is plugged through the mongoriver plugin.
The problem is that the function takes ~850ms to return the results. I'd like to decrease that number as much as possible.
Here's the python code I'm using to query the db(the limit is usually 16):
def search_news(term, limit, page, flagged_articles):
query = {
"query": {
"from": page*limit,
"size": limit,
"multi_match" : {
"query" : term,
"fields" : [ "title^3" , "category^5" , "entities.name^5", "art_text^1", "summary^1"]
}
},
"filter" : {
"not" : {
"filter" : {
"ids" : {
"values" : flagged_articles
}
},
"_cache" : True
}
}
}
es_query = json_util.dumps(query)
uri = 'http://localhost:9200/newsidx/_search'
r = requests.get(uri, data=es_query)
results = json.loads( r.text )
data = []
for res in results['hits']['hits']:
data.append(res['_source'])
return data
And here's the index mapping:
{
"news": {
"properties": {
"actual_rank": {
"type": "long"
},
"added": {
"type": "date",
"format": "dateOptionalTime"
},
"api_id": {
"type": "long"
},
"art_text": {
"type": "string"
},
"category": {
"type": "string"
},
"downvotes": {
"type": "long"
},
"entities": {
"properties": {
"etype": {
"type": "string"
},
"name": {
"type": "string"
}
}
},
"flags": {
"properties": {
"a": {
"type": "long"
},
"b": {
"type": "long"
},
"bad_image": {
"type": "long"
},
"c": {
"type": "long"
},
"d": {
"type": "long"
},
"innapropiate": {
"type": "long"
},
"irrelevant_info": {
"type": "long"
},
"miscategorized": {
"type": "long"
}
}
},
"media": {
"type": "string"
},
"published": {
"type": "string"
},
"published_date": {
"type": "date",
"format": "dateOptionalTime"
},
"show": {
"type": "boolean"
},
"source": {
"type": "string"
},
"source_rank": {
"type": "double"
},
"summary": {
"type": "string"
},
"times_showed": {
"type": "long"
},
"title": {
"type": "string"
},
"top_entities": {
"properties": {
"einfo_test": {
"type": "string"
},
"etype": {
"type": "string"
},
"name": {
"type": "string"
}
}
},
"tweet_article_poster": {
"type": "string"
},
"tweet_favourites": {
"type": "long"
},
"tweet_retweets": {
"type": "long"
},
"tweet_user_rank": {
"type": "double"
},
"upvotes": {
"type": "long"
},
"url": {
"type": "string"
}
}
}
}
Edit: The response time was measured on the server, given the tornado server information output.
Upvotes: 0
Views: 690
Reputation: 752
First of all you can add the boosts to your mapping (assuming it doesn't interfere with your other queries) like this:
"title": {
"boost": 3.0,
"type": "string"
},
"category": {
"boost": 5.0,
"type": "string"
},
etc.
Then setup a bool query with field (or term) queries like this:
"query": {
"bool" : {
"should" : [ {
"field" : {
"title" : term
}
}, {
"field" : {
"category" : term
}
} ],
"must_not" : {
"ids" : {"values" : flagged_articles}
}
}
}
"from": page * limit,
"size": limit
This should perform better, but without access to your setup I can't test it :)
Upvotes: 1
Reputation: 3094
I've rewritten your query somewhat here, moving the size
and limit
to the outside scope, adding the filtered
query clause and changing your not
query to a bool
/must_not
query, which should be cached by default:
{
"query": {
"filtered": {
"query": {
"multi_match" : {
"query" : term,
"fields" : [ "title^3" , "category^5" , "entities.name^5", "art_text^1", "summary^1"]
}
},
"filter" : {
"bool" : {
"must_not" : {
"ids" : {"values" : flagged_articles}
}
}
}
}
}
"from": page * limit,
"size": limit,
}
I haven't tested this, and I haven't made sense of your mapping as it is jumbled, so there might be some improvements to be made there.
Edit: This is a great read on why to use the bool
filter: http://www.elasticsearch.org/blog/all-about-elasticsearch-filter-bitsets/ - in short, bool
uses 'bitsets', which are very fast on subsequent queries.
Upvotes: 2