Reputation: 4663
I want to get a listing of values and counts for a specific column (column "a") in a Cassandra table using Datastax and Spark, but I'm having trouble determining the correct method of performing that request.
I'm essentially trying to do the equivalent of a T-SQL
SELECT a, COUNT(a)
FROM mytable
I've tried the following using datastax and spark on Cassandra
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
val rdd = sc.cassandraTable("mykeyspace", "mytable").select("a")
rdd.groupBy(row => row.getString("a")).count()
This looks to just give me the count of distinct values in the a column, but I was more after a listing of the values and the counts of those values (so val1:10 ... val2:5 ... val3:12 ... and so forth. I've tried some .collect and similar; just not sure how to get the listing there; any help would be appreciated.
Upvotes: 1
Views: 766
Reputation: 2949
The code snippet below will fetch the partition key named "a" and get the column with "column_name" and find the number of count for that.
val cassandraPartitionKeys = List("a")
val partitionKeyRdd = sc.parallelize(cassandraPartitionKeys)
val cassandraRdd = partitionKeyRdd.joinWithCassandraTable(keyspace,table).map(x => x._2)
cassandraRdd.map(row => (row.getString("column_name"),1)).countByKey().collect.foreach(println)
Upvotes: 1
Reputation: 4663
It seems like this could be a partial answer (it provides the correct data, but there likely is a better solution)
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
val rdd = sc.cassandraTable("mykeyspace", "mytable").groupBy(row => row.getString("a"))
rdd.foreach{ row => { println(row._1 + " " + row._2.count(x => true)) } }
I'm assuming there is a better solution, but this looks to work in terms of getting results.
Upvotes: 0