Sourav Gulati
Sourav Gulati

Reputation: 1359

Error while running range query on multiple clustering columns using spark cassandra connector:

Following is the cassandra table schema :

CREATE TABLE my_table ( year text, month text, day text, hour int, min int, sec int, PRIMARY KEY ((year, month, day), hour, min, sec) )

If i run following query using cassandra cql it works:

SELECT * FROM my_table WHERE year ='2017' and month ='01' and day ='16' and (hour,min,sec) > (1,15,0) LIMIT 200

However, when i run same query using spark-cassandra connector it does not work:

sparkSession.read().format("org.apache.spark.sql.cassandra").options(map).load()
                .where(year ='2017' and month ='01' and day ='16' and (hour,min,sec) >= (1,15,0)");

I am getting following exception in logs:

> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> cannot resolve '(struct(`hour`, `min`, `sec`) >= struct(1, 15, 0))'
> due to data type mismatch: differing types in '(struct(`hour`, `min`,
> `sec`) >= struct(1, 15, 0))'  and (struct<hour:int,min:int,sec:int>
> struct<col1:int,col2:int,col3:int>).; line 1 pos 96

Spark-cassandra-connector version:2.0.0-M3

Spark-version:2.0.0

Any help is much appreciated

Upvotes: 0

Views: 190

Answers (1)

RussS
RussS

Reputation: 16576

Quite simply CQL is not Spark Sql or Catalyst compatible. What you are seeing is a conflict in syntax.

This where clause :

.where(year ='2017' and month ='01' and day ='16' and (hour,min,sec) >= (1,15,0)

Is not directly pushed down to Cassandra. Instead it is being transformed into catalyst predicates. This is where you have a problem

Cataylst sees this

(hour,min,sec) >= (1,15,0)

And tries to make types for them

The left hand side becomes

struct<hour:int,min:int,sec:int>

The right hand side becomes

struct<col1:int,col2:int,col3:int>

These are not tuples, but explicitly typed structs. They cannot be directly compared hence your error. In the DataFrame api you would just define a new Struct with the correct types and make a literal of that but I'm not sure how to express that in SparkSQL.

Regardless this tuple predicate will not be pushed down to Cassandra. The Struct you are defining of hour, min, sec is going to be hidden from Cassandra because the underlying table doesn't provide a Struct<hour, min, sec> which means that Spark thinks it needs to generate that after pulling the data from Cassandra.

You are better off just using the separate clauses with AND as mentioned by @AkashSethi

Upvotes: 1

Related Questions