joshm1
joshm1

Reputation: 553

Upgraded RDS postgres from 9.4 to 9.5, CPU has been stuck at 100% for hours

After upgrading my RDS postgres from 9.4 to 9.5, I'm seeing the CPU stay around 100% for over 8 hours now.

I'm seeing the same database queries that used to take < 1 second run for 10+ minutes before I manually cancel them.

I'm not dealing with a large database. Most of the tables being queried are < 10000 rows

CPU usage

My read IOPS and write IOPS are very low compared to normal (mostly because the sites are down and I shutdown non-critical services.

I've been watching pg_stat_activity for active queries and don't see anything unusual (except for the long-running queries that used to take < 1 second).

I did upgrade from 9.5 to 9.6 just for the hell of it and it didn't help.

Any suggestions for debugging this? I'm stumped and many sites are down.

Upvotes: 4

Views: 1791

Answers (2)

tannart
tannart

Reputation: 315

ANALYZE VERBOSE;

I experienced an extremely similar issue, right down to the versions of postgres being moved from and to and was able to solve it just by running ANALYZE.

The issue is that the query plans postgres has generated are optimised for the previous version of postgres, when you do an RDS update it does not implicity regenerate these plans, this needs to be done manually (I'm sure there's a reason why AWS doesn't do this manually but I've really no idea why).

In my case I saw roughly a week of extremely high CPU usage, just as in your case, then after running ANALYZE my cpu dropped back to it's previous baseline. As you can see in the image below the upgrade (in my case from 9.4 - 9.5) was run on 11/27, the analyze query was run on 12/02.

(The VERBOSE is not strictly necessary but it's useful to be able to watch the progress of the command)

1 week of postgres cpu usage.

Upvotes: 9

Ben McNiel
Ben McNiel

Reputation: 8801

Debugging RDS is hard since you can't inspect the host OS. If possible you could take a snapshot of each DB then create a 2 brand new RDS instances from the snapshot on version 9.5 and 9.6. This would help you understand if this is a problem with:

  1. The upgrade process (if the new instances work)
  2. Your application running on 9.6 (if high cpu on 9.6 but the 9.5 instance works)
  3. Something else.

Upvotes: 0

Related Questions