Reputation: 163
I have a table with lots of columns, for ex. test_event and also I have another table test in the same keyspace that contains id's of rows I have to delete from test_event.
I tried deleteFromCassandra, but it doesn't works because spark cannot see SparkContext. I found some solutions used DELETE FROM, but it was written in scala.
After about hundred attempts I finally get confused and asked for your help. Can somebody do it with me step by step?
Upvotes: 1
Views: 1940
Reputation: 87164
Spark Cassandra Connector (SCC) itself provides only Dataframe API for Python. But there is a pyspark-cassandra package that provides RDD API on top of the SCC, so deletion could be performed as following.
Start pyspark shell with (I've tried with Spark 2.4.3):
bin/pyspark --conf spark.cassandra.connection.host=IPs\
--packages anguenot:pyspark-cassandra:2.4.0
and inside read data from one table, and do delete. You need to have source data to have the columns corresponding to the primary key. It could be full primary key, partial primary key, or only partition key - depending on it, Cassandra will use corresponding tombstone type (row/range/partition tombstone).
In my example, table has primary key consisting of one column - that's why I specified only one element in the array:
rdd = sc.cassandraTable("test", "m1")
rdd.deleteFromCassandra("test","m1", keyColumns = ["id"])
Upvotes: 1
Reputation: 377
Take a look on this code:
from pyspark.sql import SQLContext
def main_function():
sql = SQLContext(sc)
tests = sql.read.format("org.apache.spark.sql.cassandra").\
load(keyspace="your keyspace", table="test").where(...)
for test in tests:
delete_sql = "delete from test_event where id = " + test.select('id')
sql.execute(delete_sql)
Be aware of deleting one row at a time is not a best practice on spark but the above code is just an example to help you figure out your implementation.
Upvotes: 1