How to know no of rows Inserted using Spark In cassandra

Question

I'm inserting into cassandra using Spark.

CassandraJavaUtil.javaFunctions(newRDD)
            .writerBuilder("dmp", "dmp_user_user_profile_spark1", mapToRow(UserSetGet.class)).saveToCassandra();
            logger.info("DataSaved");

My question is if RDD has 5k rows, and while inserting into Cassandra for some reason the job fails.

Will there be rollback for the rows that were Inserted out of 5k

and if not , how will I know how many rows were actually inserted , Such that I can start my job again from the failed row.

Abhishek Anand · Accepted Answer

Simple answer, No, there will not be automatic rollback.

Whatever data spark was able to save into cassandra, will be persisted into cassandra.

And no, there is no simple way to know till what dataset, spark job was able to save successfully. Infact, only way i can think of is, to read data from cassandra, join and filter out from your resultset, based on key.

To be honest, that seems quite and overhead if data is huge to make humongous join. In most cases, you can simply re-run the job on spark, and let it save to cassandra table again. Since, in cassandra update and inserts work same way. It won't be a problem.

Only place this can be problematic is, if you are dealing with counter tables.

Update : For this specific scenario, You can split your rdd into batches of your size, and then try to save them. That way, if you fail on one rdd, you will know which rdd failed. If not that set, you should be able to pick up from next rdd for sure.

How to know no of rows Inserted using Spark In cassandra

Answers (1)

Related Questions