Timur K
Timur K

Reputation: 426

MongoDB perfromance issues in an AWS

We have the following MongoDB setup:

Infrastructure

3 replica set running in AWS. At this point, the nodes are all in the same availability zone and are all i3.large instances. 2 of the nodes are hosting the DB data on the NVME drives, and 1 is hosting it on an EBS volume with Provisioned IOPS.

Data

The data set up is a bit dubious, but should work fine according to my understanding of the documentation.

We have a database per customer - about 55 thousand of them.

Each database contains a few collections with account-specific data. There is nothing particularly fancy about the data but some collections do have indices in addition to ones that are there on _id by default.

The Writes

Most of the data that is written are events associated with customer accounts. These are collected into a queue (SQS) and are currently written by a single thread. The writer buffers about 15 minutes worth of data, then determines which database to write the data to, and flushes it. The process is a C# Windows service.

Some write operations bypass the queue (considered high priority). These include account creations and other high priority events.

The Problem

For the most part, this setup works fine. The problem comes when we need to perform an operation on all customer accounts - e.g. delete events that are older than X, or add certain data to every account. In both of these scenarios, the process operates on a list of account IDs, and does its thing.

Both scenarios have about the same problem manifestation. The process starts up fast and goes through many accounts quickly. As it keeps going past ~20k accounts, it starts slowing down. The slowdowns get longer and longer and start affecting DB reads. As of last run, it became unresponsive after ~41k accounts have been processed (but already caused read failures at that point).

The DB itself does response. From the terminal, I am able to get the rs.status() and am able to get rs.printSlaveReplicationInfo(), which is now showing a growing gap between PRIMARY and SECONDARIES.

Connecting to the database from a remote client get stuck on retrieving replica sets.

There is nothing in the logs that stands out on either the PRIMARY or SECONDARIES. A snippet of dump from primary is below.

Any thoughts or ideas?

Thanks!


2018-12-18T18:46:43.238+0000 I NETWORK [conn39304] received client metadata from 172.30.1.180:52756 conn39304: { driver: { name: "NetworkInterfaceASIO-RS", version: "3.6.8" }, os: { type: "Linux", name: "PRETTY_NAME="Debian GNU/Linux 9 (stretch)"", architecture: "x86_64", version: "Kernel 4.9.0-8-amd64" } } 2018-12-18T18:46:44.059+0000 I COMMAND [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after logicalSessionRecordCache: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 0, after transactions: 0, after transportSecurity: 0, after wiredTiger: 1058, at end: 1058 } 2018-12-18T18:46:44.498+0000 I NETWORK [conn39305] received client metadata from 172.30.1.193:58142 conn39305: { driver: { name: "NetworkInterfaceASIO-RS", version: "3.6.7" }, os: { type: "Linux", name: "PRETTY_NAME="Debian GNU/Linux 9 (stretch)"", architecture: "x86_64", version: "Kernel 4.9.0-8-amd64" } } 2018-12-18T18:46:44.500+0000 I ACCESS [conn39305] Successfully authenticated as principal __system on local 2018-12-18T18:46:44.540+0000 I ACCESS [conn39304] Successfully authenticated as principal __system on local 2018-12-18T18:46:45.758+0000 I COMMAND [PeriodicTaskRunner] task: UnusedLockCleaner took: 243ms 2018-12-18T18:47:22.360+0000 I NETWORK [listener] connection accepted from 172.30.1.180:52758 #39306 (245 connections now open) 2018-12-18T18:47:22.399+0000 I NETWORK [conn39306] received client metadata from 172.30.1.180:52758 conn39306: { driver: { name: "NetworkInterfaceASIO-RS", version: "3.6.8" }, os: { type: "Linux", name: "PRETTY_NAME="Debian GNU/Linux 9 (stretch)"", architecture: "x86_64", version: "Kernel 4.9.0-8-amd64" } } 2018-12-18T18:47:22.401+0000 I ACCESS [conn39306] Successfully authenticated as principal __system on local 2018-12-18T18:47:22.465+0000 I NETWORK [listener] connection accepted from 172.30.1.193:58144 #39307 (246 connections now open) 2018-12-18T18:47:22.539+0000 I NETWORK [conn39307] received client metadata from 172.30.1.193:58144 conn39307: { driver: { name: "NetworkInterfaceASIO-RS", version: "3.6.7" }, os: { type: "Linux", name: "PRETTY_NAME="Debian GNU/Linux 9 (stretch)"", architecture: "x86_64", version: "Kernel 4.9.0-8-amd64" } } 2018-12-18T18:47:22.579+0000 I ACCESS [conn39307] Successfully authenticated as principal __system on local 2018-12-18T18:47:35.372+0000 I ACCESS [conn137] Successfully authenticated as principal __system on local 2018-12-18T18:47:35.374+0000 I ACCESS [conn137] Successfully authenticated as principal __system on local 2018-12-18T18:47:35.377+0000 I ACCESS [conn137] Successfully authenticated as principal __system on local 2018-12-18T18:47:35.554+0000 I ACCESS [conn137] Successfully authenticated as principal __system on local 2018-12-18T18:47:37.797+0000 I ACCESS [conn137] Successfully authenticated as principal __system on local 2018-12-18T18:47:46.685+0000 I NETWORK [listener] connection accepted from 172.30.1.187:33484 #39308 (247 connections now open) 2018-12-18T18:47:46.699+0000 I NETWORK [conn39308] received client metadata from 172.30.1.187:33484 conn39308: { driver: { name: "mongo-csharp-driver", version: "0.0.0.0" }, os: { type: "Windows", name: "Microsoft Windows NT 6.2.9200.0", architecture: "x86_64", version: "6.2.9200.0" }, platform: ".NET Framework 4.5" } 2018-12-18T18:47:46.770+0000 I ACCESS [conn39308] Successfully authenticated as principal tdservice on admin 2018-12-18T18:48:02.362+0000 I NETWORK [listener] connection accepted from 172.30.1.180:52760 #39309 (248 connections now open) 2018-12-18T18:48:02.419+0000 I NETWORK [conn39309] received client metadata from 172.30.1.180:52760 conn39309: { driver: { name: "NetworkInterfaceASIO-RS", version: "3.6.8" }, os: { type: "Linux", name: "PRETTY_NAME="Debian GNU/Linux 9 (stretch)"", architecture: "x86_64", version: "Kernel 4.9.0-8-amd64" } } 2018-12-18T18:48:02.421+0000 I ACCESS [conn39309] Successfully authenticated as principal __system on local 2018-12-18T18:48:02.470+0000 I NETWORK [listener] connection accepted from 172.30.1.193:58146 #39310 (249 connections now open) 2018-12-18T18:48:02.489+0000 I NETWORK [conn39310] received client metadata from 172.30.1.193:58146 conn39310: { driver: { name: "NetworkInterfaceASIO-RS", version: "3.6.7" }, os: { type: "Linux", name: "PRETTY_NAME="Debian GNU/Linux 9 (stretch)"", architecture: "x86_64", version: "Kernel 4.9.0-8-amd64" } } 2018-12-18T18:48:02.510+0000 I ACCESS [conn39310] Successfully authenticated as principal __system on local

Upvotes: 0

Views: 337

Answers (1)

Timur K
Timur K

Reputation: 426

So just an update on this issue. The dubious data setup (with a database per customer) seems to be highly correlated with the problem. When we restructured the data such that it is within a single database and expanded the collections to explicitly identify the customer, all of the problems went away. It seems that the overhead of maintaining the 'database' is prohibitive for such architecture.

Upvotes: 0

Related Questions