Reputation: 728
My database is very cpu constrained, and I can't find the root cause of the issue. I currently have two applications servers each wit a Rails api connecting to PostgreSQL via the ruby-pg gem. Both application server also have sidekiq running background jobs, and I have a handful of support servers processing new posts from a national feed via sidekiq. If I were running out of memory, the solution would seemingly be straight forward. Any general ideas why I am CPU constrained?
Database Specs:
Possible Problems:
The database has nearly 10GB of indexes. I am going to upgrade my database to PostgreSQL version >= 9.2. In version 9.2, index only scans were introduced.
In the postgresql.conf, I have set max connection equal to '500'. Usually throughout the day, only 175 connections are utilized, but during peak times, sidekiq tasks will increase the current connections to 350. How many connections are recommended with an 8GB server instance?
When I take a look at pg_stat_activity in the psql console, I see sidekiq is leaving a lot of IDLE connections. Could these connections result in CPU inflation? Does the fix exist in the api or in sidekiq?
Maybe there is not a bug. I might need to simply increase the server instance. Again this would make more sense if I was memory bound. However, both app servers and 3 of the support sidekiq servers are 4gb performance tier instances. Essentially, servers that interact with the database have combined more than double the resources of the database. Should this even matter?
Additional questions:
Upvotes: 3
Views: 3977
Reputation: 324711
You are using massively too many concurrent connections. PostgreSQL will be wasting lots of its time on housekeeping and juggling concurrent queries. All the concurrent work will be fighting for CPU and buffer space, there'll be heavy contention on spinlocks, and it'll all generally be a mess.
On an 8 core machine, you should probably not have more than 20 actively working connections if you're mostly CPU constrained. If you're I/O limited, you can go higher, but 350 is just ridiculous.
If possible, put a PgBouncer in transaction pooling mode in front of your PostgreSQL instance, so queries get queued up and executed rapidly in series instead of slowly in parallel.
See number of database connections (Pg wiki).
Additionally, PostGIS can be very CPU-heavy. It sometimes needs to do very complex calculations. I suggest using the auto_explain
module to record long running queries, and using pg_stat_statements
/ pg_stat_plans
to record what's taking up resources. Examine these queries to see if they need improvement.
Your idle in transaction
sessions must be dealt with, too. Depending on why they're idle and whether they have a transaction ID or not, they might be causing serious table bloat. They're also creating unnecessary signalling overhead within PostgreSQL, as it has to do more co-ordination with backends that're actively doing things. Finally, the number of open transactions its self increases the cost of some internal housekeeping operations.
So. Your DB will probably perform better if you reduce the connection counts, put a PgBouncer in transaction pooling mode in front, and fix those idle connections.
Upvotes: 14
Reputation: 44285
Most likely you are CPU constrained because your work needs a lot of CPU. :)
9.1 is not generally bad at indexes. There may be some specific issues, as all versions might, which exactly what they are might change from version to version.
Index-only-scans are mostly a benefit when you are IO constrained. I wouldn't hold out much hope for that being a magic bullet for you.
350 connections are certainly not helpful, but probably are not very harmful, either. But when they are harmful, it can be downright catastrophic. The correct value is more determined by the number of cores, not the amount of RAM. If it is easy to throttle down the sidekiq connections, do it even if you can't prove that it helps.
If the connections are just IDLE, not IDLE in transaction, then they probably aren't very harmful, but again there are a few cases where they can be. That is pretty much the same issue as the number of connections.
The connection you showed from top
was idle in transaction. That status shouldn't be taking up much CPU, so that probably means it is rapidly cycling through statements and top
just happens to catch it while it is between them. But you didn't say how many similar lines there were in top
, if it is just that one it suggests your code is not running concurrently and 7 of you 8 CPUs are wasted.
Regarding the db server versus the other servers, if the database is fundamentally the limit, beating on it with a bigger hammer is not going to help. Often there is some flexibility about where computation is done. If you can get the app servers to do more computation that is currently done on the db and let the db focus on ACID issues, that would be good. But no one but you can know if that is possible or feasible.
My first stop would be to use pg_stat_statements to see what SQL statements are taking the most time. Maybe just adding an index to the slowest/most frequent query would make the problem magically go away.
Upvotes: 2