nick
nick

Reputation: 610

Mongo disk read bottleneck for large data sets and IOPS limits

I'm having trouble understanding where the bottlenecks exist relating to reading data from disk in a Mongo database collection. I know indexes are huge factor in optimizing queries, but let's say we have a collection with no indexes and I'm running a simple query in a collection with 25 million records at around 50Gb:

db.customers.find({ first_name: "xyz" })

Of course, this has to run a COLLSCAN, so it's very slow (unless it's cached in memory). But how slow is significant in our case. Running some tests reveal that the machine I run this query on does not peg my available IOPS. On a machine with a max ~10K read IOPS, this simple query is throttled at around 1.2K. Notice the CPU iowait disk usage

The query is clearly limited by disk, but it's not utilizing the full potential of what's available on the machine. Interestingly, when I create another database connection and run two queries asynchronously, the IOPS load increases 2x. It seems as though each query can only scan though so much data on disk at a time. What's holding it back when running these queries that don't have indexes?

Longer term, I think coupling an Elasticsearch engine to this will help when attempting complex searching on a lot of diverse data, but I'm really curious why we can't scale any vertically in this case.

Upvotes: 3

Views: 3624

Answers (1)

Joe
Joe

Reputation: 28356

The mongod nodes use a structure similar to a btree to store data. The leaf page can contain many documents up to 32Kb (compressed), or a single document if it is that size or larger.

The collection scan is run in the database layer above the storage engine layer. The database layer requests the next document from storage when it gets done examining the current one.
The storage layer will return the document if it is already in cache. If not, it will request the next page from the operating system. The OS may deliver that from the filesystem cache or read it from disk. The storage engine then decompress it, stores the documents in the cache, and provides the requested document to the database layer.

The thread that is processing the documents to see if they match the query is taking turns with a storage engine thread. When you run 2 queries at the same time, there is a second thread processing the second query, so they can interleave their requests to the disk, resulting in higher IO usage.

Upvotes: 6

Related Questions