Reputation: 741

Redshift column storage method

I want to know how Amazon Redshift stores data for a column. Let's make my question clear by an example.

Suppose we have a table includes a column with type Character Varying (4 bytes + length of string * 1 for ASCII values) and there are 1000 records in this table. But 500 records do not have a value for the given column (null value). Now the question is what would be the size of the storage required to store this column ? is it (4000 + sum of length of strings) bytes ? is it less than the mentioned value ? How Redshift stores null values in this scenario ?

Thank you.

Upvotes: 1

Answers (3)

Viswanath Dandigunta

Reputation: 11

I see Redshift by default assigns LZO compression for columns if not specified with any compression encoding.

LZO encoding provides a very high compression ratio with good performance. LZO encoding works especially well for CHAR and VARCHAR columns that store very long character strings, especially free form text, such as product descriptions, user comments, or JSON strings. LZO is the default encoding except for columns that are designated as sort keys and columns that are defined as BOOLEAN, REAL, or DOUBLE PRECISION data types.

Source: docs.aws.amazon.com

Upvotes: 1

ciphor

Reputation: 8288

I would say, it depends.

First, compression matters a lot, proper compression could dramatically reduce the storage space needed.

Second, RedShift storage is managed at block (1MB) level, even if you just have ten bytes for a certain column, it will occupy one block space.

Third, it is relevant to vacuum, RedShift won't release the storage space for deleted records unless vacuum applied.

If you want to test the storage space using different table schema or compression types, you can use the following query:

SELECT trim(pgn.nspname) as Schema
      ,trim(a.name) as Table
      ,CASE WHEN LENGTH (TRIM (a.name)) > 30 THEN 'Y'
            ELSE 'N'
       END AS TableNameViolation    
      ,id as TableId
      ,det.distkey
      ,det.NumberOfSortKeys
      ,det.FirstSortCol
      ,b.mbytes
      ,b.mbytes / 1024 AS GBytes
      ,det.HasEncoding
      ,a.rows
      ,a.unsorted_rows
FROM (SELECT db_id
            ,id
            ,name
            ,sum (rows) as rows
            ,sum (rows) - sum(sorted_rows) as unsorted_rows
      FROM stv_tbl_perm a
      GROUP BY db_id
              ,id
              ,name
     ) as a
INNER JOIN pg_class as pgc
      ON pgc.oid = a.id
INNER JOIN pg_namespace as pgn
      ON pgn.oid = pgc.relnamespace
LEFT OUTER JOIN (SELECT tbl
                       ,count(*) as mbytes
                 FROM stv_blocklist
                 GROUP BY tbl
                ) b
     ON a.id = b.tbl
INNER JOIN ( SELECT attrelid
                   ,MIN (case attisdistkey when 't' then attname else null end)  as "distkey"
                   ,MIN (case attsortkeyord when 1 then attname  else null end ) AS FirstSortCol
                   ,MAX (attsortkeyord) as NumberOfSortKeys
                   ,MAX (CASE WHEN attencodingtype > 0 THEN 'Y' ELSE 'N' END) as HasEncoding  
             FROM pg_attribute
             WHERE attnum >= 1
             GROUP BY attrelid
           ) as det
      ON det.attrelid = a.id
WHERE mbytes IS NOT NULL
ORDER BY pgn.nspname
        ,mbytes DESC
;

Upvotes: 4

Rakesh Singh

Reputation: 170

If you don't define encoding of the column. It will store in Raw format( without any encoding), and will not do any compression. And in this scenario it will have to allocate the default space for all records, whether or not they are null.

The actual space taken varies a lot based on the encoding that you choose and also the type of data you have.

Upvotes: 1

Redshift column storage method

Answers (3)

Related Questions