behas
behas

Reputation: 3476

Writing large Spark dataframes to Cassandra - Performance Tuning

I am working on top of a Spark 2.1.0 / Cassandra 3.10 cluster (4 machines * 12 cores * 256 RAM * 2 SSDs) and struggle for quite a while with the performance of writing a specific large data frame to Cassandra using spark-cassandra-connector 2.0.1.

Here is the schema of my table

CREATE TABLE sample_table (
        hash blob,
        field1 int,
        field2 int,
        field3 boolean,
        field4 bigint,
        field5 bigint,
        field6 list<FROZEN<some_type>>,
        field7 list<FROZEN<some_other_type>>,
        PRIMARY KEY (hash)
);

Hashes, which are used as primary keys are 256bit; list fields contain up to 1MB of data of some structured type. In total, I need to write several hundred millions of rows.

At the moment I am using the following write method:

  def storeDf(df: Dataset[Row]) = {
    df.write
      .cassandraFormat(sample_table, sample_keyspace)
      .options(
          WriteConf.ConsistencyLevelParam.option(ConsistencyLevel.ANY)
      )
      .save
  }

...and Spark writes the dataframe using 48 parallel tasks, each writing approx. 95MB in 1.2h, which is of course not what I want.

I'd appreciate suggestions on how to tune write performance AND/OR possibly modify my schema in such a setting. Does repartitioning by hash and sorting within a partition make sense?

Thanks!

Upvotes: 1

Views: 1596

Answers (1)

Yogesh Mahajan
Yogesh Mahajan

Reputation: 241

You can refer to this blog for Spark-Cassandra connector tuning. You will get an idea on perf numbers that you can expect. Also You can try out another open source product SnappyData, which is the Spark database, which will give you very high performance in your use case.

Upvotes: 1

Related Questions