Mustafa Genç
Mustafa Genç

Reputation: 2579

Spark Cassandra Connector - Input Fetch Size

I'm using Cassandra 3.11.4 and Spark 2.3.3. When I query lots of partition keys(for 3 months where minute is partition key = 3 * 30 * 24 * 60 partition keys) with joinWithCassandraTable, I'm seeing lots of slow timeout logs under cassandra debug.log like:

<SELECT * FROM event_keyspace.event_table WHERE partitionkey1, partitionkey2 = value1, value2 AND column_key = column_value1 LIMIT 5000>, time 599 msec - slow timeout 500 msec 

<SELECT * FROM event_keyspace.event_table WHERE partitionkey1, partitionkey2 = value5, value6 AND column_key = column_value5 LIMIT 5000>, time 591 msec - slow timeout 500 msec/cross-node

I'm using repartitionByCassandraReplica before joinWithCassandraTable.

I see that disk IO goes to 100%. If I change data model where hour goes as partition key instead of minute, large partitions will be created which is not applicable.

I suspect that this limit 5000 may cause that but even I set input.fetch.size_in_rows this log did not change.

sparkConf.set("spark.cassandra.input.fetch.size_in_rows", "20000");

How can i set this LIMIT 5000 clause ?

Upvotes: 2

Views: 950

Answers (1)

ThinkTank0790
ThinkTank0790

Reputation: 43

Did you try reducing the spark.cassandra.input.split.size? Because all the data is falling under the same Partition.

Upvotes: 0

Related Questions