Reputation: 741
I want to know how Amazon Redshift stores data for a column. Let's make my question clear by an example.
Suppose we have a table includes a column with type Character Varying (4 bytes + length of string * 1 for ASCII values) and there are 1000 records in this table. But 500 records do not have a value for the given column (null value). Now the question is what would be the size of the storage required to store this column ? is it (4000 + sum of length of strings) bytes ? is it less than the mentioned value ? How Redshift stores null values in this scenario ?
Thank you.
Upvotes: 1
Views: 1075
Reputation: 11
I see Redshift by default assigns LZO compression for columns if not specified with any compression encoding.
LZO encoding provides a very high compression ratio with good performance. LZO encoding works especially well for
CHAR
andVARCHAR
columns that store very long character strings, especially free form text, such as product descriptions, user comments, or JSON strings. LZO is the default encoding except for columns that are designated as sort keys and columns that are defined asBOOLEAN
,REAL
, orDOUBLE PRECISION
data types.
Source: docs.aws.amazon.com
Upvotes: 1
Reputation: 8288
I would say, it depends.
First, compression matters a lot, proper compression could dramatically reduce the storage space needed.
Second, RedShift storage is managed at block (1MB) level, even if you just have ten bytes for a certain column, it will occupy one block space.
Third, it is relevant to vacuum, RedShift won't release the storage space for deleted records unless vacuum applied.
If you want to test the storage space using different table schema or compression types, you can use the following query:
SELECT trim(pgn.nspname) as Schema
,trim(a.name) as Table
,CASE WHEN LENGTH (TRIM (a.name)) > 30 THEN 'Y'
ELSE 'N'
END AS TableNameViolation
,id as TableId
,det.distkey
,det.NumberOfSortKeys
,det.FirstSortCol
,b.mbytes
,b.mbytes / 1024 AS GBytes
,det.HasEncoding
,a.rows
,a.unsorted_rows
FROM (SELECT db_id
,id
,name
,sum (rows) as rows
,sum (rows) - sum(sorted_rows) as unsorted_rows
FROM stv_tbl_perm a
GROUP BY db_id
,id
,name
) as a
INNER JOIN pg_class as pgc
ON pgc.oid = a.id
INNER JOIN pg_namespace as pgn
ON pgn.oid = pgc.relnamespace
LEFT OUTER JOIN (SELECT tbl
,count(*) as mbytes
FROM stv_blocklist
GROUP BY tbl
) b
ON a.id = b.tbl
INNER JOIN ( SELECT attrelid
,MIN (case attisdistkey when 't' then attname else null end) as "distkey"
,MIN (case attsortkeyord when 1 then attname else null end ) AS FirstSortCol
,MAX (attsortkeyord) as NumberOfSortKeys
,MAX (CASE WHEN attencodingtype > 0 THEN 'Y' ELSE 'N' END) as HasEncoding
FROM pg_attribute
WHERE attnum >= 1
GROUP BY attrelid
) as det
ON det.attrelid = a.id
WHERE mbytes IS NOT NULL
ORDER BY pgn.nspname
,mbytes DESC
;
Upvotes: 4
Reputation: 170
If you don't define encoding of the column. It will store in Raw format( without any encoding), and will not do any compression. And in this scenario it will have to allocate the default space for all records, whether or not they are null.
The actual space taken varies a lot based on the encoding that you choose and also the type of data you have.
Upvotes: 1