Reputation: 2579
I'm using Cassandra 3.11.4 and Spark 2.3.3. When I query lots of partition keys(for 3 months where minute is partition key = 3 * 30 * 24 * 60 partition keys) with joinWithCassandraTable, I'm seeing lots of slow timeout logs under cassandra debug.log like:
<SELECT * FROM event_keyspace.event_table WHERE partitionkey1, partitionkey2 = value1, value2 AND column_key = column_value1 LIMIT 5000>, time 599 msec - slow timeout 500 msec
<SELECT * FROM event_keyspace.event_table WHERE partitionkey1, partitionkey2 = value5, value6 AND column_key = column_value5 LIMIT 5000>, time 591 msec - slow timeout 500 msec/cross-node
I'm using repartitionByCassandraReplica before joinWithCassandraTable.
I see that disk IO goes to 100%. If I change data model where hour goes as partition key instead of minute, large partitions will be created which is not applicable.
I suspect that this limit 5000 may cause that but even I set input.fetch.size_in_rows this log did not change.
sparkConf.set("spark.cassandra.input.fetch.size_in_rows", "20000");
How can i set this LIMIT 5000 clause ?
Upvotes: 2
Views: 950
Reputation: 43
Did you try reducing the spark.cassandra.input.split.size? Because all the data is falling under the same Partition.
Upvotes: 0