Reputation: 20023
Let's say I have 4 identical servers with 300GB hard drive space and a replication factor of 2 (so basically 2 300GB nodes, each replicated on another physical machine with 300GB space), how does the space allocation work across these nodes?
For instance, imagine 300GB on Node 1 and 2 (node 2 being the replica of 1) is completely used by cassandra and another application which also uses disk space, but the second set (nodes 3 and 4) have some free disk space since they're only running Cassandra and nothing else. Would Cassandra store new entries on these nodes instead given the fact the first 2 nodes are out of disk space, or would it blow up?
Broadening the situation across multiple servers in a rack, would Cassandra intelligently manage disk space requirements and put the data on nodes with more free storage space? Similarly, would it be able to work with servers with varying storage spaces? (some 600GB, some 300GB, etc.).
Many thanks,
Upvotes: 2
Views: 1578
Reputation: 16576
Cassandra does not allocate data by available space. It places data on nodes based on the hash of their Partition Key. Because of this there can be no intelligent live balancing of where data should go.
To do approximate balancing you can change the size of the token ranges a particular node is responsible for (no-vnodes) or adjust the number of vnodes. This all needs to be done manually.
Changes in the Cassandra.yaml
Example Vnodes:
Node 1: num_token: 128
Node 2: num_token: 128
Node 3: num_token: 256
Node 4: num_token: 256
Example Non-Vnodes (given a full range = 100):
Node1: initial_token: 15
Node2: initial_token: 30
Node3: initial_token: 65
Node4: initial_token: 100
Upvotes: 3