How do I architect my Elasticsearch indexes for rapidly changing data?

Question

I have two models in my MySQL database: Users and Posts

users have geolocation attributes (lat/long)

posts simply have a body of text.

I want to use Elasticsearch to find all posts that match a string of text plus use the user's location as a filter. The problem is -- the user's location always changes (as people walk around the city). I will be frequently updating the lat/long of each user.

This is my current solution:

Index the posts and have a geolocation attribute in each document. When a user changes location, run an elasticsearch batch update on all that user's posts, and modify the geolocation attribute on those documents.

Obviously this is not a scalable solution -- what if the user has 2000 posts and walks around the city? I'd have to update 2000 documents every minute.

Is there a way to "relationally map" the posts to the user's object and use it as a filter, so that when the location changes, I only need to update that user's object instead of all his posts?

Val · Accepted Answer

Updating 2000 posts per minute is not a big deal either with the update by query plugin or with the upcoming reindex API. However, if you have many users with many posts and you need to update them in short intervals (e.g. 1 min), it might not be that scalable, indeed. Say if it takes 500 milliseconds to update all posts from a user, you'd start to lag behind at around 120 users.

Clearly, since the users' posts need to "follow" the user and don't keep the location the user had when she posted them, I would first query the users around a given location and get their IDs, and then run a second query on posts filtered by those user IDs and the matching body text.

It is perfectly OK to keep both of your indices simple and only update the location in a single user's document every minute. Those two queries I'm suggesting should be quite fast and you should not be worried of running them. People are often worried when they need to run two or more queries in order to find their results. Sometimes, trying to tie the documents to tight together is not the solution and simply running two queries over two indices is the key and works perfectly well.

The query to retrieve users would look similar to the first one below, where you only retrieve the _id property of the user. I'm making the assumption that your user documents have the id of the user as their ES doc _id, so you do not have to retrieve the _source at all (i.e. "_source": false) which is even faster and you can simply return the _id with response filtering:

POST /users/_search?filter_path=hits.hits._id
{
  "size": 1000,
  "_source": false,
  "query": {
    "bool": {
      "filter": [
        {
          "geo_distance": {
            "distance": "100m",
            "location": {
              "lat": 32.5362723,
              "lon": -80.3654783
            }
          }
        }
      ]
    }
  }
}

You'll get all the _id values of the users who are currently 100 meters around the desired geographic location. Then the next query consists of filtering the posts by those ids while matching their body text.

POST /posts/_search
{
  "size": 50,
  "query": {
    "bool": {
      "must": {
        "match": {
          "body": "some text"
        }
      },
      "filter": [
        {
          "terms": {
            "user_id": [ 1, 2, 3, 4 ]
          }
        }
      ]
    }
  }
}

How do I architect my Elasticsearch indexes for rapidly changing data?

Answers (1)

Related Questions