NicolasZ
NicolasZ

Reputation: 983

PostgreSQL Error invalid DSA memory alloc request size. Pg 17 configuration

2 weeks after migrating to Postgres 17.2 we started getting an error on a query that worked flawlessly for years on Postgres 14. I suspected this could be related to the configuration parameters of the database or the major version, but in theory we are using the default values that come set up in both versions. We are running a 64Gb ram postgres RDS instance and we are doing joins between tables with dozens of millions of records in a data flow. We managed to isolate the issue to a query that does 2 regular outer joins on indexed columns.

SELECT *
FROM company
LEFT OUTER JOIN company_s
  ON company.domain = company_s.domain
LEFT OUTER JOIN socials
  ON company_s.raw_domain = socials.domain

This query returns normally in under 4 minutes. But in this case it runs from 5 to 11 minutes and then produces the error

invalid DSA memory alloc request size 1811939328

It is quite odd that the flow ran without issue for the first 2 weeks then stoped working while using the same data.

This shows up in our identical staging and production environments, in both cases the instance abnormally uses all its memory (usually there are more than 10GB free when running the flows), however in staging it tends to fail after 11 minutes sometimes accompanied by SSL SYSCALL Error: EOF detected. In production it fails instead after 5 minutes with : invalid DSA memory alloc request size 1811939328

enter image description here What we tried

  1. Trimming Oversized Entries
  2. Rebuilding all Indices
  3. Running analyze on all the involved tables
  4. Running on a reduced sample of only 1MM entries works, but that is not a real solution to our problem
  5. Incrementing shared_buffers to 32GB

There is a bug report with the same error as we have related to PG 17 but there is no solution info related https://www.postgresql.org/message-id/18349-83d33dd3d0c855c3%40postgresql.org

There are a few questions with issues regarding the same problem, mostly without answers or with answers that do not apply to our problem

Upvotes: 0

Views: 307

Answers (1)

NicolasZ
NicolasZ

Reputation: 983

Following the lead of the possible bug report on PG17, we ran the queries with

SET max_parallel_workers_per_gather = 0

and that made the query return in 5 minutes without errors. Digging deeper we decided to review the work_mem, which was set by default to 4MB by RDS. We updated this value in the configuration to 64MB based loosely on this parameter guide, and it started working smoothly, returning in around 3 minutes.

SET work_mem TO '64MB';

The working explain analyze looks like this

enter image description here

Upvotes: 1

Related Questions