ChrisHDog
ChrisHDog

Reputation: 4663

Apache Spark Count by Group Method

I want to get a listing of values and counts for a specific column (column "a") in a Cassandra table using Datastax and Spark, but I'm having trouble determining the correct method of performing that request.

I'm essentially trying to do the equivalent of a T-SQL

SELECT a, COUNT(a)
FROM mytable

I've tried the following using datastax and spark on Cassandra

import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
val rdd = sc.cassandraTable("mykeyspace", "mytable").select("a")
rdd.groupBy(row => row.getString("a")).count()

This looks to just give me the count of distinct values in the a column, but I was more after a listing of the values and the counts of those values (so val1:10 ... val2:5 ... val3:12 ... and so forth. I've tried some .collect and similar; just not sure how to get the listing there; any help would be appreciated.

Upvotes: 1

Views: 766

Answers (2)

Knight71
Knight71

Reputation: 2949

The code snippet below will fetch the partition key named "a" and get the column with "column_name" and find the number of count for that.

val cassandraPartitionKeys = List("a")
val partitionKeyRdd = sc.parallelize(cassandraPartitionKeys)

val cassandraRdd = partitionKeyRdd.joinWithCassandraTable(keyspace,table).map(x => x._2)

cassandraRdd.map(row => (row.getString("column_name"),1)).countByKey().collect.foreach(println)

Upvotes: 1

ChrisHDog
ChrisHDog

Reputation: 4663

It seems like this could be a partial answer (it provides the correct data, but there likely is a better solution)

import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
val rdd = sc.cassandraTable("mykeyspace", "mytable").groupBy(row => row.getString("a"))
rdd.foreach{ row => { println(row._1 + " " + row._2.count(x => true)) } }

I'm assuming there is a better solution, but this looks to work in terms of getting results.

Upvotes: 0

Related Questions