Braydon Batungbacal
Braydon Batungbacal

Reputation: 1038

How would I use DynamoDB to move this usage from my mysql db to nosql?

I'm currently experiencing issues with a service I've developed that relies heavily on large payload reads from the db (500 rows). I'm seeing huge throughput, in the range of 35,000+ requests per minute for up to 500 rows per request going through the DB and it is not handling the scaling at all.

The data in question is retrieved primarily on a latitude / longitude where statement that checks if the latitude and longitude of the row can be contained within a minimum latitude longitude coordinate, and a maximum latitude longitude coordinate. This is effective checking if the row in question is within the bounding box created by the min / max passed into the where.

This is the where portion of the query we rely on for reference.

s.latitude > {minimumLatitude} AND 
s.longitude > {minimumLongitude} AND 
s.latitude < {maximumLatitude} AND 
s.longitude < {maximumLongitude}

SO, with that said. MySQL is handling this find, I'm presently on RDS and having to rely heavily on an r3.8XL master, and 3 r3.8XL reads just to get the throughput capacity I need to prevent the application from slowing down and throwing the CPU into 100% usage.

Obviously, with how heavy the payload is and how frequently it's queried this data needs to be moved into a more fitting service. Something like Elasticache's services or DynamoDB.

I've been leaning towards DynamoDB, but my only option here seems to be using SCAN as there is no useful primary key I can associate on my data to reduce the result set as it relies on calculating if the latitude / longitude of a point is within a bounding box. DynamoDB filters on attributes would work great as they support the basic conditions needed, however on a table that would be 250,000+ rows and growing by nearly 200,000 a day or more would be unusably expensive.

Another option to reduce the result set was to use a Map Binning technique to associate a map region with the data, and reduce on that in dynamo as the primary key and then further filter down on the latitude / longitude attributes. This wouldn't be ideal though, we'd prefer to get data within specific bounds and not have excess redundant data passed back as the min/max lat/lng can overlap multiple bins and would then pull data from pins that a majority may not be needed for.

At this point I'm continuously having to deploy read replicas to keep the service up and it's definitely not ideal. Any help would be greatly appreciated.

Upvotes: 1

Views: 173

Answers (1)

Michael - sqlbot
Michael - sqlbot

Reputation: 178956

You seem to be overlooking what seems like it would be the obvious first thing to try... indexing the data using an index structure suited to the nature of the data... in MySQL.

B-trees are of limited help since you still have to examine all possible matches in one dimension after eliminating impossible matches in the other.

Aside: Assuming you already have an index on (lat,long), you will probably be able to gain some short-term performance improvement by adding a second index with the columns reversed (long,lat). Try this on one of your replicas¹ and see if it helps. If you have no indexes at all, then of course that is your first problem.

Now, the actual solution. This requires MySQL 5.7 because before then, the feature works with MyISAM but not with InnoDB. RDS doesn't like it at all if you try to use MyISAM.

This is effective checking if the row in question is within the bounding box created by the min / max passed into the where.

What you need is an R-Tree index. These indexes actually store the points (or lines, polygons, etc.) in an order that understands and preserves their proximity in more than one dimension... proximate points are closer in the index and minimum bounding rectangles ("bounding box") are easily and quickly identified.

The MySQL spatial extensions support this type of index.

There's even an MBRContains() function that compares the points in the index to the points in the query, using the R-Tree to find all the points contained in thr MBR you're searching. Unlike the usual optimization rule that you should not use column names as function arguments in the where clause to avoid triggering a table scan, this function is an exception -- the optimizer does not actually evaluate the function against every row but uses the meaning of the expression to evaluate it against the index.

There's a bit of a learning curve needed in order to understand the design of the spatial extensions but once you understand the principles, it falls into place nicely and the performance will exceed your expectations. You'll want a single column of type GEOMETRY and you'll want to store lat and long together in that one indexed column as a POINT.

To safely test this without disruption, make a replica, then detach it from your master, promoting it to become its own independent master, and upgrade it to 5.7 if necessary. Create a new table with the same structure plus a GEOMETRY column and a SPATIAL KEY, then populate it with INSERT ... SELECT.


Note that DynamoDB scan is a very "expensive" operation. On a table I was testing against just yesterday, a single scan consistently cost 112 read units each time it was run, regardless of the number of records, presumably because a scan always reads 1MB of data, which is 256 blocks of 4K (definition of a read unit) but not with strong consistency (so, half the cost). 1 MB ÷ 4KB ÷ 2 = 128 which I assume is close enough to 112 that this explains that number.


¹ It's a valid, supported operation to add an index to a MySQL replica but not the master, even in RDS. You need to temporarily make the replica writable by creating a new parameter group identical to the existing one, and then flipping read_only to 0 in that group. Associate the replica to the new parameter group, then wait for the state to change from applying to in-sync, log in to the replica and add the index. Then put the parameter group back when done.

Upvotes: 1

Related Questions