Reputation: 435
I am totally new/beginner with Cassandra. I have researched a bit about how Cassandra works (https://www.scnsoft.com/blog/cassandra-performance) but I got into a situation.
I have 2 CSV that sum 384 MB and a Win10 virtual machine with almost 10 GB free of storage. My objective is to store the 384 MB of CSV (7.496.735 rows) in a single table in Cassandra using spark/scala from IntelliJ (everything in the same single node virtual machine). I suppose that I will consume something like 200-400 MB of storage, but the reality was quite different. It consumed all 10 GB of disk, before failing due lack of disk. I thought "this must be the replication factor", but it can't be as the keyspace was created like:
CREATE KEYSPACE IF NOT EXISTS testkeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 } AND DURABLE_WRITES = true ;
When counting the rows stored (it lasted forever, doing several operations on the console by itself), it manages to save: 1.767.450 rows.
The following day I realized that it "frees" 6,38 GB of disk.
My questions are:
why Cassandra needed soo much available disk space for such a little data (first 10 GB and later 3,5 GB for a less than 0,5 GB of raw data)?
why it later frees disk space (6,38 GB that was supposed to be used)?
and finally, how can I successfully store the CSV data in Cassandra from spark/scala?
The code for writing is :
val spark_cassandra = cassandra_session()
cassandra_write(spark_cassandra, joined_df_unique, "joined_df_unique", "testkeyspace")
def cassandra_write( spark_cassandra : SparkSession, df : DataFrame , df_name : String, keyspace : String ) = {
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector
import org.apache.spark.sql.cassandra._
val sparkContext = spark_cassandra.sparkContext
val connector = CassandraConnector(sparkContext.getConf)
df.createCassandraTable(keyspace,df_name) //, writeConf = writeConf)
df.write.cassandraFormat(df_name,keyspace).mode(SaveMode.Append).save()
}
def cassandra_session() : SparkSession = {
val spark_cassandra = org.apache.spark.sql.SparkSession
.builder()
.master("local[*]")
.config("spark.cassandra.connection.host", "localhost")
.appName("Spark Cassandra Connector Example")
.getOrCreate()
spark_cassandra
}
// ("com.datastax.spark" %% "spark-cassandra-connector" % "2.4.3")
Sorry if this is too basic, It is my first-time storing fon spark/scala to Cassandra. Thanks in advance.
Upvotes: 1
Views: 883
Reputation: 20551
Cassandra stores data on disk as immutable SSTables (each SSTable consists of a few files). The immutability of SSTables solves certain problems inherent to distributed systems, which I won't go into here.
The consequence of immutability is that when you update or delete a value, you just write the new value (or in the case of a deletion, you write a tombstone which essentially says "this value was deleted at such-and-such time"). UPDATE is essentially another INSERT and DELETE is just a really-special INSERT.
This is somewhat simplified, but the upshot is that if all the INSERTs consumed x bytes of disk, after running y UPDATE or DELETE queries, your total disk consumption might not be much less than (1 + y) * x.
There is a compaction process within Cassandra which in our scenario would eventually combine the three SSTables with values for "A" (including the tombstone) into a single SSTable with only the last value (i.e. the tombstone) for "A", and after that eventually remove any trace of "A" from the SSTables (note that in a cluster, it's not unheard of for the tombstone to not propagate all the way around the cluster, resulting in data which was deleted being resurrected as a "zombie"). Depending on the compaction strategy in use and the volume of writes, a lot of extra disk space may be consumed before any space is reclaimed: there are even compaction strategies that may never reclaim space (an example is TimeWindowCompaction, common in the time-series use-case).
It's worth noting that a read which hits too many (default, IIRC, is 100k) tombstones will fail to return any data; this should be another consideration with a DELETE-heavy workload.
If you're repeatedly updating/deleting the same keys, your disk consumption will grow without bound unless compaction is able to keep up with your writes.
Upvotes: 4