How does Redshift concurrency scaling work?

Question

In the AWS doc it is written that:

When you turn on concurrency scaling, Amazon Redshift automatically adds additional cluster capacity to process an increase in both read and write queries. Users see the most current data, whether the queries run on the main cluster or a concurrency-scaling cluster.

This is really vague to me. How can this new cluster be created ?

"Users see the most current data"

If the data is spread across multiple nodes on different EBS disks, how come the new cluster be created with the most up to date data ? Is the feature based on EBS node snapshot ?

Bill Weiner · Accepted Answer

Why do you think that Redshift storage is based on EBS? This would be a networked storage solution and not provide the speed and bandwidth needed for a big data solution like Redshift. The node storage is system internal AFAIK.

To understand how concurrent clusters work lets look at a simplistic approximation. The basis for Redshift coherency (the property that allows users to see the most up to date information) is the block. These blocks are distributed around the base cluster's nodes such that data being used by any node may be remote (over the network from this node). So even for the base cluster data can be remote and the Redshift coherency system ensures that the correct version of any block is served up in all cases. This system is a MVCC model (multi-version coherency control) and works well for databases distributed across a networked cluster like Redshift.

The concurrent cluster can be seen as just a bunch more nodes that tap into the base cluster's coherency system. In this case all the data blocks are remote to the node doing the work not just some of the blocks. It is the coherency system that ensures that the right blocks are served to any requesting node (base or remote).

Now the actual implementation of concurrent clusters is more complex than this to ensure performant execution. Nodes in the concurrent cluster are paired to the "same" node in the base cluster and can cache blocks for this cluster to use. But requests for these cached blocks are always coherency checked against the base cluster because this is the source. If things are unchanged then the cached version of the data can be used. In this way the concurrent cluster can have all the data it needs and the ability to fairly independently execute read-only queries. There is little additional load on the base cluster once blocks are cached on the concurrent cluster EXCEPT for coherency checking which is done on the base cluster. If the bulk of the data in your database is static and most of your query load is read-only then very high levels of concurrency scaling can be achieved. However, if your data is changing rapidly there will be a lot of additional coherency checking and copying of new versions of blocks. Since these actions impact the base cluster the amount of coherency scaling should be limited (1-3) in these cases.

Update: references and materials for further exploration of Redshift storage, blocks and MVCC have been requested.

Redshift Architecture overview: https://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html

Redshift data organization overview: https://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html

Redshift deep dive by AWS (slide 28 on storage): https://www.slideshare.net/AmazonWebServices/deep-dive-on-amazon-redshift-72473281

MVCC overview: https://en.wikipedia.org/wiki/Multiversion_concurrency_control

Posgres8 (Redshift forked from Postgres8) MVCC: https://www.postgresql.org/docs/8.1/mvcc.html

Redshift intro presentation from re:Invent: https://www.slideshare.net/AmazonWebServices/deep-dive-on-amazon-redshift-72473281

My cut at this as part of re:Invent 2016: https://www.youtube.com/watch?v=bxfnWTiY7EM

How does Redshift concurrency scaling work?

Answers (1)

Related Questions