Reputation: 5361
I would normally load data from Cassandra into Apache Spark this way using Java:
SparkContext sparkContext = StorakleSparkConfig.getSparkContext();
CassandraSQLContext sqlContext = new CassandraSQLContext(sparkContext);
sqlContext.setKeyspace("midatabase");
DataFrame customers = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " +
"WHERE CAST(store_id as string) = '" + storeId + "'");
But imagine I have a sharder and I need to load several partion keys into this DataFrame. I could use WHERE IN (...) in my query and again use the cassandraSql method. But I am a bit reluctant to use WHERE IN due to the infamous problem with having a one-point-of-failure in terms of the coordinator node. This is explained here:
Is there a way to use several queries but load them into a single DataFrame?
Upvotes: 3
Views: 1652
Reputation: 2092
One way to do this would be run individual queries and unionAll/union multiple DataFrames/RDDs.
SparkContext sparkContext = StorakleSparkConfig.getSparkContext();
CassandraSQLContext sqlContext = new CassandraSQLContext(sparkContext);
sqlContext.setKeyspace("midatabase");
DataFrame customersOne = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " + "WHERE CAST(store_id as string) = '" + storeId1 + "'");
DataFrame customersTwo = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " + "WHERE CAST(store_id as string) = '" + storeId2 + "'");
DataFrame allCustomers = customersOne.unionAll(CustomersTwo)
Upvotes: 1