Writing large Spark dataframes to Cassandra - Performance Tuning

Question

I am working on top of a Spark 2.1.0 / Cassandra 3.10 cluster (4 machines * 12 cores * 256 RAM * 2 SSDs) and struggle for quite a while with the performance of writing a specific large data frame to Cassandra using spark-cassandra-connector 2.0.1.

Here is the schema of my table

CREATE TABLE sample_table (
        hash blob,
        field1 int,
        field2 int,
        field3 boolean,
        field4 bigint,
        field5 bigint,
        field6 list>,
        field7 list>,
        PRIMARY KEY (hash)
);

Hashes, which are used as primary keys are 256bit; list fields contain up to 1MB of data of some structured type. In total, I need to write several hundred millions of rows.

At the moment I am using the following write method:

  def storeDf(df: Dataset[Row]) = {
    df.write
      .cassandraFormat(sample_table, sample_keyspace)
      .options(
          WriteConf.ConsistencyLevelParam.option(ConsistencyLevel.ANY)
      )
      .save
  }

...and Spark writes the dataframe using 48 parallel tasks, each writing approx. 95MB in 1.2h, which is of course not what I want.

I'd appreciate suggestions on how to tune write performance AND/OR possibly modify my schema in such a setting. Does repartitioning by hash and sorting within a partition make sense?

Thanks!

Writing large Spark dataframes to Cassandra - Performance Tuning

Answers (1)

Related Questions