user2870968
user2870968

Reputation: 553

RDS PostgreSQL DB slow & timeouts during daily 2-hour CPU peaks

For over 2 weeks now I've observed that my RDS instance (PostgreSQL 10.6 on a db.t3.small) has daily 2-hour CPU peaks during business hours, together with increased read and write latency, that lead to poor responsiveness or timeouts in my application.

I did investigate (see below) and at this point, I'm quite convinced these peaks impacting my users aren't caused by my usage and tend to think they're caused either by rogue admin tasks from RDS or some PostgreSQL issue.

Did someone endure and resolve a similar issue with PostgreSQL? Can someone help me investigate the RDS admin tasks side? Or point me to other avenues to get to the bottom of these?

What I observe :

What I've investigated :

Here are basic logs around a peak start (before activating statement logs):

2019-12-09 15:04:05 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:04:05 UTC::@:[4221]:LOG:  checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.202 s, sync=0.001 s, total=0.213 s; sync files=2, longest=0.001 s, average=0.000 s; distance=16369 kB, estimate=16395 kB
2019-12-09 15:09:05 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:09:05 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.101 s, sync=0.001 s, total=0.112 s; sync files=1, longest=0.001 s, average=0.001 s; distance=16384 kB, estimate=16394 kB
2019-12-09 15:14:05 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:14:05 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.101 s, sync=0.002 s, total=0.113 s; sync files=1, longest=0.002 s, average=0.002 s; distance=16384 kB, estimate=16393 kB
2019-12-09 15:19:06 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:19:06 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.101 s, sync=0.001 s, total=0.113 s; sync files=1, longest=0.001 s, average=0.001 s; distance=16384 kB, estimate=16392 kB

[CPU PEAK STARTS here that day, at 16:20 UPC+1]

2019-12-09 15:24:06 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:24:06 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.101 s, sync=0.002 s, total=0.114 s; sync files=1, longest=0.002 s, average=0.002 s; distance=16384 kB, estimate=16391 kB
2019-12-09 15:29:06 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:29:06 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.101 s, sync=0.002 s, total=0.113 s; sync files=1, longest=0.001 s, average=0.001 s; distance=16384 kB, estimate=16390 kB
2019-12-09 15:34:06 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:34:06 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.103 s, sync=0.002 s, total=0.118 s; sync files=1, longest=0.002 s, average=0.002 s; distance=16384 kB, estimate=16390 kB
2019-12-09 15:39:06 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:39:06 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.104 s, sync=0.003 s, total=0.127 s; sync files=1, longest=0.002 s, average=0.002 s; distance=16384 kB, estimate=16389 kB
2019-12-09 15:44:06 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:44:06 UTC::@:[4221]:LOG:  checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.219 s, sync=0.010 s, total=0.303 s; sync files=2, longest=0.010 s, average=0.005 s; distance=16392 kB, estimate=16392 kB
2019-12-09 15:49:07 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:49:09 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.318 s, sync=0.516 s, total=2.426 s; sync files=1, longest=0.516 s, average=0.516 s; distance=16375 kB, estimate=16390 kB
2019-12-09 15:54:07 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:54:09 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.367 s, sync=1.230 s, total=2.043 s; sync files=1, longest=1.230 s, average=1.230 s; distance=16384 kB, estimate=16389 kB
2019-12-09 15:59:07 UTC::@:[4221]:LOG:  checkpoint starting: time
2019-12-09 15:59:08 UTC::@:[4221]:LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.139 s, sync=0.195 s, total=1.124 s; sync files=1, longest=0.195 s, average=0.195 s; distance=16383 kB, estimate=16389 kB

CPU around 1 peak, CPU over a week, Read latency around a peak, Write latency around a peak,Performance Insights around Dec 10 peak, Performance Insights around Dec 9 peak

Upvotes: 3

Views: 1547

Answers (1)

Stijn De Haes
Stijn De Haes

Reputation: 1

It could be that you are running out of burst credits on your disk because of a background process of PostgreSQL. All disks on Rds are of the gp2 type if I remember correctly. Meaning you have a certain base iops and credits you can spend to go above that for a small period of time. You should be able to see this effect in the queue depth metric on the monitoring page. If this is happing you should see a peak in that number . The easiest solution is to just increase disk size.

Upvotes: 0

Related Questions