gerrit
gerrit

Reputation: 421

Postgres table size way bigger than sum of rows

I want to collect large amounts of data and thus need to be very storage efficient. Right now as I'm testing, I'm seeing that my database table is way bigger than the sum of its rows indicates and so far I was unable to figure out why that is.

The table has the following columns:

  1. bigint, primary key
  2. numeric, precision 10, scale 7
  3. numeric, precision 10, scale 7
  4. timestamp without time zone
  5. timestamp without time zone
  6. bigint, indexed
  7. character varying
  8. json

I ran the following queries to collect statistical information:

SELECT pg_size_pretty(pg_relation_size('table_name')); # 132MB
SELECT pg_size_pretty(pg_total_relation_size('table_name')); # 162MB
SELECT COUNT(*) FROM table_name; # 91706
SELECT pg_relation_size('table_name') / (SELECT COUNT(*) FROM table_name); # 1503
SELECT octet_length(table_name.*::text) FROM table_name ORDER BY 1 DESC LIMIT 1; # 312
SELECT AVG(octet_length(table_name.*::text)) FROM table_name; # 279.05

I only have ~91700 rows in the database right now - average row size 279 bytes - and they already eat up 132MB (table data only) instead of 91700 * 279 bytes = ~25MB of space. Rows are supposed to go into the hundreds of millions on the production system and data taking up five times more space than expected is a huge problem for me right now that I don't want to solve when I'm about to run into actual limits.

Most of the data in a row is flexible and does not follow a fixed schema. It is stored in a json column. As json takes up less space than jsonb and the json column will rarely to never be queried, this is acceptable. The missing schema for the bulk of each row led me to initially start with MongoDB. When testing in MongoDB, I had ~1.7 million documents that took up about ~660MB of space including the index. I thought I could do better in Postgres, but if my Postgres table grows linearly, I will already reach break even at ~370000 rows.

The database runs on a shared ElephantSQL instance and the statistics say there are 86MB of temp files and the total database size is 172MB.

I have two questions:

  1. The difference between table size and total table size is 30MB. What could this be other than the index? (I assume the temp files are not involved here). There is only one index on this table, which is on a bigint column.
  2. How come the actual average row size (1503B) is more than five times higher than the average row size (279,05B) that octet_length returns?

Additional information and example:

I set up a dedicated ElephantSQL instance and started collecting data, but here the proportions are even worse:

SELECT pg_size_pretty(pg_relation_size('table_name')); # 16MB
SELECT pg_size_pretty(pg_relation_size('table_name')); # 21MB
SELECT COUNT(*) FROM table_name; # 20672
SELECT pg_relation_size('table_name') / (SELECT COUNT(*) FROM table_name); # 819
SELECT AVG(octet_length(table_name.*::text)) FROM table_name; # 113.53

The amount of non-relational data coming in was actually lower here, thus the row size is smaller. In this case, the actual average row size is more than seven times higher than the average row size that octet_length returns. This brings me to the assumption that the overhead per row - wherever it may originate from - is massive and the less actual json data each row contains, the worse the size ratio between these two metrics gets.

Upvotes: 1

Views: 2141

Answers (1)

jjanes
jjanes

Reputation: 44192

PostgreSQL has a large per-row overhead, but not nearly this large (unless elephantSQL has done something to explode it further). Your table must be extremely bloated. You can confirm that by running VACUUM FULL table_name and seeing if that shrinks it.

If it is bloated, it would be due to something about the way you use it (e.g. the pattern of INSERT, UPDATE, DELETE on it) or the way you vacuum it. But you haven't described any of that stuff.

Upvotes: 5

Related Questions