emson
emson

Reputation: 10505

How to efficiently search large datasets by location and date range?

I have a MongoDB collection containing attributes such as:

longitude, latitude, start_date, end_date, price

I have over 500 million documents.

My question is how to search by lat/long, date range and price as efficiently as possible?
As I see it my options are:

  1. Create an Geo-spatial index on lat/long and use MongoDB's proximity search... and then filter this based on date range and price.
    • I have yet to test this but, am worrying that the amount of data would be too much to search this quickly, when we have around 1 search a second.
    • have you had experience with how MongoDB would react under these circumstances?
  2. Split the data into multiple collections by location. i.e. by cities like london_collection, paris_collection, new_york_collection.
    • I would then have to query by lat/long first, find the nearest city collection and then do a MongoDB spatial search on that subset data in that collection with date and price filters.
    • I would have uneven distribution of documents as some cities would have more documents than others.
  3. Create collections by dates instead of location. Same as above but each document is allocated a collection based on it's date range.
    • problem with searches that have a date range that straddles multiple collections.
  4. Create unique ids based on city_start_date_end_date for each document.
    • Again I would have to use my lat/long query to find the nearest city append the date range to access the key. This seems to be pretty fast but I don't really like the city look up aspect... it seems a bit ugly.

I am in the process of experimenting with option 1.) but would really like to hear your ideas before I go too far down one particular path?

How do search engines split up and manage their data... this must be a similar kind of problem?

Also I do not have to use MongoDB, I'm open to other options?

Many thanks.

Upvotes: 2

Views: 1857

Answers (2)

mnemosyn
mnemosyn

Reputation: 46301

Why do you think option 1 would be too slow? Is this the result of a real world test or is this merely an assumption that it might eventually not work out?

MongoDB has native support for geohashing and turns coordinates into a single number which can then be searched by a BTree traversal. This should be reasonably fast. Messing around with multiple collections does not seem like a very good idea to me. All it does is replace one level of BTree traversal on the database with some code you still need to write, test and maintain.

Don't reinvent the wheel, but try to optimize the most obvious path (1) first:

  1. Set up geo indexes
  2. Use explain to make sure your queries actually use the index
  3. Make sure your indexes fit into RAM
  4. Profile the database using the built-in profiler
  5. Don't measure performance on a 'cold' system where the indexes didn't have a chance to go to RAM yet
  6. If possible, try not to use geoNear if possible, and stick to the faster (but not perfectly spherical) near queries
  7. If you're still hitting limits, look at sharding to distribute reads and writes to multiple machines.

Upvotes: 1

Sam Greenhalgh
Sam Greenhalgh

Reputation: 6136

Indexing and data access performance is a deep and complex subject. A lot of factors can effect the most efficient solution including the size of your data sets, the read to write ratio, the relative performance of your IO and backing store, etc.

While I can't give you a concrete answer, I can suggest investigating using morton numbers as an efficient way of pulling multiple similar numeric values like lat longs.

Morton number

Upvotes: 2

Related Questions