cgam
cgam

Reputation: 39

How many partitions spark will create running on aws emr and reading a table from cassandra?

I am reading a table from cassandra , my table size is 100 GB. I am running spark on aws EMR. I have below questions to ask ?

  1. How many partitions will be created in spark , reading a table from cassandra ? [ I am aware that spark cassandra connector has input split size property that by default is 64 MB,so number of spark partitions = data size/64 MB . But in my case there is huge difference number of partitions I see in spark ]
  2. If one of cassandra partition is too big , would there be any impact on number of partitions on spark ? [ my spark job is failing with memory issue , but I am under impression as data split by 64 MB , Is there still chance of data skewness ?]

Upvotes: 2

Views: 307

Answers (1)

Alex Ott
Alex Ott

Reputation: 87119

If Cassandra partitions are smaller than split size, then they will be combined into single Spark partition. But if Cassandra partition is bigger than split size, the Spark partition will be the size of the Cassandra partition - Cassandra connector won't split Cassandra partition into chunks.

This blog post from the one of the authors of Spark Cassandra connector talks in details how partitioning works with it.

Upvotes: 2

Related Questions