Reputation: 11
I am using Spark 2.4.7 and I have implemented normal pyspark cassandra connector, but there is a use case where I need to implement key based connector, I am not getting useful blogs/tutorials around it, Someone please help me with it.
I have tried normal pyspark-cassandra connector and it is working good. Now I am expecting to implement keybased connector, which I am unable to find.
Normally Cassandra Loads entire table but I want not to load entire table but run a query on source and get the required data
By keybased I want to get data using some keys i.e. using where condition like
Select *
From <table_name>
Where <column_name>!=0
should run on source and load those data only which satisfies this condition.
Upvotes: 1
Views: 82
Reputation: 87299
To have this functionality you need to understand how both Spark & Cassandra works separately & together:
spark.read
, Spark doesn't load all data - it just fetches metadata, like, table structure, column names & types, partitioning schema, etc.where
or filter
), Spark Cassandra Connector tries to perform so-called predicate pushdown - convert Spark SQL query into corresponding CQL query, but it really depends on the condition. And if it's not possible, then it goes through all data, and perform filtering on the Spark side. For example, if you have condition on the column that is partition key - then it will be converted into CQL expression SELECT ... FROM table where pk = XXX
. Similarly, there are some optimizations for queries on the clustering columns - Spark will need to go through all partitions, but it's still will be more optimized as it may filter data based on the clustering columns. Use a link above to understand what conditions could be pushed down into Cassandra and which aren't. The rule of thumb is - if you can execute query in CQLSH without ALLOW FILTERING
, then it will be pushed down.In your specific example, you're using inequality predicate (<>
or !=
) that isn't supported by Cassandra, so Spark Cassandra connector will need to go through all data, and filtering will happen on the Spark side.
Upvotes: 1