Data queried from Cassandra cannot be filtered on same column again (InvalidQueryException)

Question

I am trying to query a large chunk of data by time from cassandra, and then use spark Datasets to get smaller chunks to process at a time, however, the application fails with an invalid query exception:

WARN  2018-11-22 13:16:54 org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 5, 192.168.1.212, executor 0): java.io.IOException: Exception during preparation of SELECT "userid", "event_time", "value" FROM "user_1234"."data" WHERE token("userid") > ? AND token("userid") <= ? AND "event_time" >= ? AND "event_time" >= ? AND "event_time" <= ?   ALLOW FILTERING: More than one restriction was found for the start bound on event_time
        at com.datastax.spark.connector.rdd.CassandraTableScanRDD.createStatement(CassandraTableScanRDD.scala:323)
        at com.datastax.spark.connector.rdd.CassandraTableScanRDD.com$datastax$spark$connector$rdd$CassandraTableScanRDD$$fetchTokenRange(CassandraTableScanRDD.scala:339)
        at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$17.apply(CassandraTableScanRDD.scala:366)
        at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$17.apply(CassandraTableScanRDD.scala:366)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at com.datastax.spark.connector.util.CountingIterator.hasNext(CountingIterator.scala:12)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: More than one restriction was found for the start bound on event_time
        at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:41)
        at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:28)
        at com.datastax.driver.core.AbstractSession.prepare(AbstractSession.java:108)
        at com.datastax.driver.dse.DefaultDseSession.prepare(DefaultDseSession.java:278)
        at com.datastax.spark.connector.cql.PreparedStatementCache$.prepareStatement(PreparedStatementCache.scala:45)

This is the piece of code I am trying to execute:

case class RawDataModel(userid: String, event_time: Long, value: Double)
var dtRangeEnd = System.currentTimeMillis()
var dtRangeStart = (dtRangeEnd - (60 * 60 * 1000).toLong)

val queryTimeRange = "SELECT * FROM user1234.datafile WHERE event_time >= " + dtRangeStart

val dataFrame = sparkSession.sql(queryTimeRange)

import sparkSession.implicits._
val dataSet: Dataset[RawDataModel] = dataFrame.as[RawDataModel]

dataSet.show(1)



dtRangeEnd = System.currentTimeMillis()
dtRangeStart = (dtRangeEnd - (15 * 60 * 1000).toLong)

val dtRangeData = dataSet.filter(dataSet("event_time").between(dtRangeStart, dtRangeEnd))

dtRangeData.show(1)

Note: This is not a DataSets problem since I have tried to swap them with DataFrames with no difference. I thought this was a lazy evaluation problem at first with two different bounds being lazily applied at the same time but the dataSet.show(1) command should call an early aggregation and avoid cascaded evaluation

Aleksey Isachenkov · Accepted Answer

Spark merges sparkSession.sql(queryTimeRange) and dataSet.filter(dataSet("event_time").between(dtRangeStart, dtRangeEnd)) into one command which in cql looks like:

SELECT "sensorid", "event_time", "value" FROM "company_5a819ee2522e572c8a16a43a"."data" WHERE token("sensorid") > ? AND token("sensorid") <= ? AND "event_time" >= ? AND "event_time" >= ? AND "event_time" <= ?

And there you get two same restrictions on the same field "event_time" >= ?.

If you persist dataFrame before the execution of .filter Spark will compute dataFrame separately from .filter:

val dataFrame = sparkSession.sql(queryTimeRange)
dataFrame.persist
dataFrame.filter(dataSet("event_time").between(dtRangeStart, dtRangeEnd))

Data queried from Cassandra cannot be filtered on same column again (InvalidQueryException)

Answers (1)

Related Questions