neo
neo

Reputation: 1

Amazon Redshift table block allocation

Our cluster is a 4 node cluster. We have a table consisting 72 columns.When we query svv_diskusage table to check the allocation of columns in each slice we observed that every column has been allocated into 2 blocks (0 and 1). But for few columns we have datatype of varchar(1) which should not be occupying two blocks of space.

Is it possible that if one of the columns occupies more than a block(in case of varchar(1500)), then the same is allocated for all the other columns of the table. If yes, how this effects the overall database size of the cluster.

Upvotes: 0

Views: 1138

Answers (1)

John Rotenstein
John Rotenstein

Reputation: 269736

Each Amazon Redshift storage block is 1MB in size. Each block contains data from only one column within one table.

The SVV_DISKUSAGE system view contains a list of these blocks, eg:

select db_id, trim(name) as tablename, col, tbl, max(blocknum)
from svv_diskusage
where name='salesnew'
group by db_id, name, col, tbl
order by db_id, name, col, tbl;

db_id  | tablename  | col |  tbl   | max
--------+------------+-----+--------+-----
175857 | salesnew   |   0 | 187605 | 154
175857 | salesnew   |   1 | 187605 | 154
175857 | salesnew   |   2 | 187605 | 154
175857 | salesnew   |   3 | 187605 | 154
175857 | salesnew   |   4 | 187605 | 154
175857 | salesnew   |   5 | 187605 |  79
175857 | salesnew   |   6 | 187605 |  79
175857 | salesnew   |   7 | 187605 | 302
175857 | salesnew   |   8 | 187605 | 302
175857 | salesnew   |   9 | 187605 | 302
175857 | salesnew   |  10 | 187605 |   3
175857 | salesnew   |  11 | 187605 |   2
175857 | salesnew   |  12 | 187605 | 296
(13 rows)

The number of blocks required to store each column depends upon the amount of data and the compression encoding used for that table.

Amazon Redshift also stores the minvalue and maxvalue of the data that is stored in each block. This is visible in the SVV_DISKUSAGE table. These values are often called Zone Maps and they are used to identify blocks that can be skipped when scanning data. For example, if a WHERE clause looks for rows with a value of 5 in that column, then blocks with a minvalue of 6 can be entirely skipped. This is especially useful when data compressed.

To investigate why your data is consuming two blocks, examine:

  • The minvalue and maxvalue of each block
  • The number of values (num_values) stored in each block

Those values will give you an idea of how much data is stored in each block, and whether that matches your expectations.

Also, take a look at the Distribution Key (DISTKEY) used on the table. If the DISTKEY is set to ALL, then table data is replicated between multiple nodes. This could also explain your block count.

Finally, if data has been deleted from the table, then old values might be consuming disk space. Run a VACUUM command on the table to remove deleted data.

A good reference is: Why does a table in my Amazon Redshift cluster consume more disk storage space than expected?

Upvotes: 1

Related Questions