Israel Rodriguez
Israel Rodriguez

Reputation: 435

why Cassandra need soo much available disk space for little data?

I am totally new/beginner with Cassandra. I have researched a bit about how Cassandra works (https://www.scnsoft.com/blog/cassandra-performance) but I got into a situation.

I have 2 CSV that sum 384 MB and a Win10 virtual machine with almost 10 GB free of storage. My objective is to store the 384 MB of CSV (7.496.735 rows) in a single table in Cassandra using spark/scala from IntelliJ (everything in the same single node virtual machine). I suppose that I will consume something like 200-400 MB of storage, but the reality was quite different. It consumed all 10 GB of disk, before failing due lack of disk. I thought "this must be the replication factor", but it can't be as the keyspace was created like:

CREATE KEYSPACE IF NOT EXISTS testkeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 } AND DURABLE_WRITES = true ;

When counting the rows stored (it lasted forever, doing several operations on the console by itself), it manages to save: 1.767.450 rows.

it lasted forever, doing several operations on the console by itself

The following day I realized that it "frees" 6,38 GB of disk.

My questions are:

why Cassandra needed soo much available disk space for such a little data (first 10 GB and later 3,5 GB for a less than 0,5 GB of raw data)?

why it later frees disk space (6,38 GB that was supposed to be used)?

and finally, how can I successfully store the CSV data in Cassandra from spark/scala?

The code for writing is :

val spark_cassandra = cassandra_session()
cassandra_write(spark_cassandra, joined_df_unique, "joined_df_unique", "testkeyspace")

def cassandra_write( spark_cassandra : SparkSession, df : DataFrame , df_name : String, keyspace : String )  = {
    import com.datastax.spark.connector._
    import com.datastax.spark.connector.cql.CassandraConnector
    import org.apache.spark.sql.cassandra._

    val sparkContext = spark_cassandra.sparkContext
    val connector = CassandraConnector(sparkContext.getConf)

    df.createCassandraTable(keyspace,df_name) //, writeConf = writeConf)
    df.write.cassandraFormat(df_name,keyspace).mode(SaveMode.Append).save()

  }

def cassandra_session()  :  SparkSession = {

    val spark_cassandra = org.apache.spark.sql.SparkSession
      .builder()
      .master("local[*]")
      .config("spark.cassandra.connection.host", "localhost")
      .appName("Spark Cassandra Connector Example")
      .getOrCreate()

    spark_cassandra
  }

 // ("com.datastax.spark" %% "spark-cassandra-connector" % "2.4.3")

Sorry if this is too basic, It is my first-time storing fon spark/scala to Cassandra. Thanks in advance.

Upvotes: 1

Views: 883

Answers (1)

Levi Ramsey
Levi Ramsey

Reputation: 20551

Cassandra stores data on disk as immutable SSTables (each SSTable consists of a few files). The immutability of SSTables solves certain problems inherent to distributed systems, which I won't go into here.

The consequence of immutability is that when you update or delete a value, you just write the new value (or in the case of a deletion, you write a tombstone which essentially says "this value was deleted at such-and-such time"). UPDATE is essentially another INSERT and DELETE is just a really-special INSERT.

  • At time 0, insert value 1 for key "A" => an SSTable containing the timestamp 0 record associating 1 with "A" is written to disk
  • At some later time n (n > 0), update key "A" with value 2 => an SSTable containing the timestamp n associating 2 with "A" is written to disk (the previous SSTable associating 1 with "A" at time 0 remains on disk)
  • After time n, a read of the value for "A" will scan the SSTables, see both the values 1 and 2 associated with "A" and choose the later one, i.e. the value 2
  • At some later time m (m > n > 0), delete key "A" => an SSTable containing the timestamp m with a tombstone for "A" is written to disk (the two previous SSTables remain)

This is somewhat simplified, but the upshot is that if all the INSERTs consumed x bytes of disk, after running y UPDATE or DELETE queries, your total disk consumption might not be much less than (1 + y) * x.

There is a compaction process within Cassandra which in our scenario would eventually combine the three SSTables with values for "A" (including the tombstone) into a single SSTable with only the last value (i.e. the tombstone) for "A", and after that eventually remove any trace of "A" from the SSTables (note that in a cluster, it's not unheard of for the tombstone to not propagate all the way around the cluster, resulting in data which was deleted being resurrected as a "zombie"). Depending on the compaction strategy in use and the volume of writes, a lot of extra disk space may be consumed before any space is reclaimed: there are even compaction strategies that may never reclaim space (an example is TimeWindowCompaction, common in the time-series use-case).

It's worth noting that a read which hits too many (default, IIRC, is 100k) tombstones will fail to return any data; this should be another consideration with a DELETE-heavy workload.

If you're repeatedly updating/deleting the same keys, your disk consumption will grow without bound unless compaction is able to keep up with your writes.

Upvotes: 4

Related Questions