Srdjan Nikitovic
Srdjan Nikitovic

Reputation: 853

Spark SQL on Cassandra table that is populated with Spark Streaming

I have a Spark Streaming process that is in real time filling a Cassandra table. I want to make queries on that Cassandra table, to have access to underlying data.

CQL is quite limited in it's syntax (limited where conditions, no group by), so I was thinking of using Spark SQL on top of it.

But once I load data frame, it will not see any changes in underlying data. How to constantly keep refresh data frames, so that they always see data changes?

Srdjan

Upvotes: 3

Views: 273

Answers (1)

user1483833
user1483833

Reputation: 71

I know this is an older post, but there seems to be a recurring theme here. The need to have full featured querying on data that has been ingested into NoSQL stores, with Spark SQL offering up the ability to do that. Couple of things to consider when going down that path

1> If working against the datastore directly using the Spark connector, even with predicate pushdowns, the relevant columns have to be moved into Spark from Cassandra/ Other NoSQL stores in order to run the query. There is little point in caching the data that has been moved into Spark because ad-hoc querying ensures that the next query requires a different set of data which means repeating the process again and causing churn in the Spark process, and inhibits performance

2> If one goes down the path of simply loading all of the data from the datastore into Spark, then you have a staleness issue mentioned above because Spark is an immutable cache. One solution to that is to have a TTL (time to live) set on the data in Spark and drop and recreate the dataframe from scratch every so often, which is wasteful and inefficient, and its not clear what happens to queries while that is being done

A best of breed solution (SnappyData is one I know) simply turns the dataframes into mutable entities, so that changes to data in the NoSQL store can be CDCed in Spark and you can execute queries using Spark SQL, without leaving the Spark cluster or having to move data into Spark for each query. This has significant performance benefits (data can be stored in columnar format, queries can be pruned, you can avoid unnecessary serialization costs, take advantage of code generation in Spark to run queries faster), reduces overall system complexity and allows you to build continuous applications that work with the latest data.

Upvotes: 1

Related Questions