why Cassandra need soo much available disk space for little data?

Question

I am totally new/beginner with Cassandra. I have researched a bit about how Cassandra works (https://www.scnsoft.com/blog/cassandra-performance) but I got into a situation.

I have 2 CSV that sum 384 MB and a Win10 virtual machine with almost 10 GB free of storage. My objective is to store the 384 MB of CSV (7.496.735 rows) in a single table in Cassandra using spark/scala from IntelliJ (everything in the same single node virtual machine). I suppose that I will consume something like 200-400 MB of storage, but the reality was quite different. It consumed all 10 GB of disk, before failing due lack of disk. I thought "this must be the replication factor", but it can't be as the keyspace was created like:

CREATE KEYSPACE IF NOT EXISTS testkeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 } AND DURABLE_WRITES = true ;

When counting the rows stored (it lasted forever, doing several operations on the console by itself), it manages to save: 1.767.450 rows.

The following day I realized that it "frees" 6,38 GB of disk.

My questions are:

why Cassandra needed soo much available disk space for such a little data (first 10 GB and later 3,5 GB for a less than 0,5 GB of raw data)?

why it later frees disk space (6,38 GB that was supposed to be used)?

and finally, how can I successfully store the CSV data in Cassandra from spark/scala?

The code for writing is :

val spark_cassandra = cassandra_session()
cassandra_write(spark_cassandra, joined_df_unique, "joined_df_unique", "testkeyspace")

def cassandra_write( spark_cassandra : SparkSession, df : DataFrame , df_name : String, keyspace : String )  = {
    import com.datastax.spark.connector._
    import com.datastax.spark.connector.cql.CassandraConnector
    import org.apache.spark.sql.cassandra._

    val sparkContext = spark_cassandra.sparkContext
    val connector = CassandraConnector(sparkContext.getConf)

    df.createCassandraTable(keyspace,df_name) //, writeConf = writeConf)
    df.write.cassandraFormat(df_name,keyspace).mode(SaveMode.Append).save()

  }

def cassandra_session()  :  SparkSession = {

    val spark_cassandra = org.apache.spark.sql.SparkSession
      .builder()
      .master("local[*]")
      .config("spark.cassandra.connection.host", "localhost")
      .appName("Spark Cassandra Connector Example")
      .getOrCreate()

    spark_cassandra
  }

 // ("com.datastax.spark" %% "spark-cassandra-connector" % "2.4.3")

Sorry if this is too basic, It is my first-time storing fon spark/scala to Cassandra. Thanks in advance.

why Cassandra need soo much available disk space for little data?

Answers (1)

Related Questions