Milen Kovachev
Milen Kovachev

Reputation: 5361

How to convert a Cassandra ResultSet to a Spark DataFrame?

I would normally load data from Cassandra into Apache Spark this way using Java:

SparkContext sparkContext = StorakleSparkConfig.getSparkContext();

CassandraSQLContext sqlContext = new CassandraSQLContext(sparkContext);
    sqlContext.setKeyspace("midatabase");

DataFrame customers = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " +
            "WHERE CAST(store_id as string) = '" + storeId + "'");

But imagine I have a sharder and I need to load several partion keys into this DataFrame. I could use WHERE IN (...) in my query and again use the cassandraSql method. But I am a bit reluctant to use WHERE IN due to the infamous problem with having a one-point-of-failure in terms of the coordinator node. This is explained here:

https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

Is there a way to use several queries but load them into a single DataFrame?

Upvotes: 3

Views: 1652

Answers (1)

Alex Naspo
Alex Naspo

Reputation: 2092

One way to do this would be run individual queries and unionAll/union multiple DataFrames/RDDs.

SparkContext sparkContext = StorakleSparkConfig.getSparkContext();

CassandraSQLContext sqlContext = new CassandraSQLContext(sparkContext);
    sqlContext.setKeyspace("midatabase");

DataFrame customersOne = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " + "WHERE CAST(store_id as string) = '" + storeId1 + "'");

DataFrame customersTwo = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " + "WHERE CAST(store_id as string) = '" + storeId2 + "'");

DataFrame allCustomers = customersOne.unionAll(CustomersTwo)

Upvotes: 1

Related Questions