Reputation: 1055
Confused by the term 'table size' in Redshift.
We have : svv_table_info.size "Size of table in 1MB blocks"
svv_table_info.pct_used "Percent of available space used"
... so I assume that a lot of the 'size' is empty space due to sort keys etc
Then we have this.. https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/
.. which uses the term 'minimum' table size. But nowhere can I find an explanation of what they means in the real world ? Is this a theoretical minimum if optimally configured ?
Ultimately I need to find out the basic size of original tangible data without any overheads.
Then yes, how much disc space is it actually costing to store it in Redshift.
So if I took 1TB out of our on-prem database and shoved it into Redshift, I'd be looking to see something like 1TB (data) & 1.2TB (data + Redshift overheads).
Hope someone can help clarify 🤞
Upvotes: 1
Views: 2015
Reputation: 11032
Redshift stores data in 1MB blocks and blocks are associated with a slice and a column. So if I have 2 slices in my cluster and a table with 4 columns (plus the 3 system columns to make 7) distributed as EVEN containing at least 2 rows, then my table will minimally take up 2 X 7 X 1MB of space (14MB on disk). This is all that article is saying.
Now if I insert 2 additional rows into this table, Redshift will makes new blocks for this data. So now my 4 rows of data take up 28MB of space. However, if I Vacuum the table the wasted space will be reclaimed and the table size will come back down to 14MB. (yes this is a bit of an oversimplification but trying to get the concepts across)
As a rule of thumb a single 1MB block will typically hold between 100,000 rows and 2,000,000 rows of compressed data. (yes this depends on the data not being monster varchars) So for our table above I can keep adding rows (and vacuuming) without increasing the table size on disk until I get a few hundred thousand rows (per slice) in the table. Redshift is very efficient at storing large chunks of data but very inefficient at storing small ones.
What Redshift knows about your data size is how many blocks it takes on disk (across all the nodes, slices, and columns). How big your data would be if it was stored differently (not in blocks, compressed or uncompressed) is not data that is tracked. As John noted, for big tables, Redshift stores data more efficiently than most other database (when compression is used).
Upvotes: 2
Reputation: 269284
You cannot translate from an existing database size to the size of a table in Redshift. This is because:
Given compression, your data is likely to be smaller in Redshift than an original (uncompressed) data source. However, you can't really calculate that in advance unless you have transferred similar data in the past and apply a similar ratio.
Upvotes: 0