Jim Jose
Jim Jose

Reputation: 1319

EC2 - MongoDB - PyMongo - Debugging Query Latency

I have 2 EC2 instances, one as a mongodb server and the other is a python web app (same availability zone). The python sever connects to mongo server using PyMongo and everything works fine.

The problem is, when I profile execution-time in python, in some calls (less than 5%) it takes even up to couple of second to return. I was able to narrow down the problem and the time delay was actually on the db calls to mongo server.

The two causes that I thought were, 1. The Mongo server is slow/over-loaded 2. Network latency

So, I tried upgrading the mongo servers to 4X faster instance but the issue still happens (Some calls takes even 3 secs to return) I assumed since both the servers are on EC2, network latency should not be a problem... but may be I was wrong.

How can I confirm if the issue is actually the network itself? If so, what the best way to solve it? Is there any other possible cause?

Any help is appreciated...

Thanks,

UPATE: The entities that I am fetching are very small (indexed) and usually the calls takes only 0.01-0.02 secs to finish.

UPDATE:

As suggested by "James Wahlin", I enabled profiling on my mongo server and got some interesting logs,

Fri Mar 15 18:05:22 [conn88635] query db.UserInfoShared query: { $or: [ { _locked: { $exists: false } }, { _locked: { $lte: 1363370603.297361 } } ], _id: "750837091142" } nto return:1 nscanned:1 nreturned:1 reslen:47 2614ms

Fri Mar 15 18:05:22 [conn88635] command db.$cmd command: { findAndModify: "UserInfoShared", fields: { _id: 1 }, upsert: true, query: { $or: [ { _locked: { $exists: false } }, { _locked: { $lte: 1363370603.297361 } } ], _id: "750837091142" }, update: { $set: { _locked: 1363370623.297361 } }, new: true } ntoreturn:1 reslen:153 2614ms

You can see these two calls took more than 2 secs to finish. The field _id is unique indexed and finding it should not have taken this much time. May be I have to post a new question for it, but can the mongodb GLOBAL LOCK be the cause ?

Upvotes: 1

Views: 1418

Answers (1)

Jim Jose
Jim Jose

Reputation: 1319

@James Wahlin, thanks a lot for helping me out.

As it turned out the main cause of latency was mongodb GLOBAL LOCK itself. We had a lock percentage which was averaging at 5% and sometime peaks up to 30-50% and that results in the slow queries.

If you are facing this issue, the first thing you have to do it enable mongodb MMS service (mms.10gen.com), which will give you a lot of insights on what exactly is happening in your db server.

In our case the LOCK PERCENTAGE was really high and there were multiple reasons for it. First thing you have to do to figure it out is to read mongodb documentation on concurrency, http://docs.mongodb.org/manual/faq/concurrency/

The reason for lock can be in application level, mongodb or hardware.

1) Our app was doing a lot of updates and each update(more than 100 ops/sec) holds a global lock in mongodb. The issue was that when an update happens for an entry which is not in memory, mongo will have to load the data into memory first and then update (in memory) and the whole process happens while inside the global lock. If say the whole thing takes 1 sec to complete (0.75sec to load the data from disk and 0.25sec to update in memory), the whole rest of the update calls waits (for the entire 1 sec) and such updates starts queuing up... and you will notice more and more slows requests in your app server.

The solution for it (while it might sound silly) is to query for the same data before you make an update. What it effectively does is moving the 'load data to memory' (0.75sec) part out of the global lock which greatly reduces your lock percentage

2) The other main cause of global lock is mongodb's data flush to disk. Basically in every 60sec (or less) mongodb (or the OS) writes the data to disk and a global lock is held during this process. (This kinda explains the random slow queries). In your MMS stats, see the graph for background flush avg... if its high, that means you have to get faster disks.

In our case, we moved to a new EBS optimized instance in EC2 and also bumped our provisioned IOPS from 100 to 500 which almost halved the background flush avg and the servers are much happier now.

Upvotes: 2

Related Questions