Reputation: 1672
I'm inserting into a Cassandra 3.12 via the Python (DataStax) driver and CQL BatchStatements [1]. With a primary key that results in a small number of partitions (10-20) all works well, but data is not uniformly distributed across nodes.
If I include a high cardinality column, for example time or client IP in addition to date, the batch inserts result in a Partition Too Large error, even though the number of rows and the row length is the same.
Higher cardinality keys should result in more but smaller partitions. How does a key generating more partitions result in this error?
[1] Although everything I have read suggests that batch inserts can be an anti-pattern, with a batch covering only one partition, I still see the highest throughput compared to async or current inserts for this case.
CREATE TABLE test
(
date date,
time time,
cid text,
loc text,
src text,
dst text,
size bigint,
s_bytes bigint,
d_bytes bigint,
time_ms bigint,
log text,
PRIMARY KEY ((date, loc, cid), src, time, log)
)
WITH compression = { 'class' : 'LZ4Compressor' }
AND compaction = {'compaction_window_size': '1',
'compaction_window_unit': 'DAYS',
'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy'};
Upvotes: 1
Views: 326
Reputation: 3266
I guess you meant Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Batch too large
errors?
This is because of the parameter batch_size_fail_threshold_in_kb
which is by default 50kB of data in a single batch - and there are also warnings earlier at a at 5Kb threshold through batch_size_warn_threshold_in_kb
in cassandra.yml (see http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/configuration/configCassandra_yaml.html).
Can you share your data model? Just adding a column doesnt mean the partition key to change - maybe you just changed the primary key only by adding a clustering column. Hint: PRIMARY KEY (a,b,c,d)
uses only a
as partition key, while PRIMARY KEY ((a,b),c,d)
uses a,b
as partition key - an easy overlooked mistake.
Apart from that, the additional column takes some space - so you can easily hit the threshold now, just reduce the batch size so it does fit again into the limits. In general it's a good way to batch only upserts the affect a single partition as you mentioned. Also make use of async queries and make parallel requests to different coordinators to gain some more speed.
Upvotes: 3