Group by and count on Spark Data frame all columns

Question

I want to Perform Group by on each column of the data frame using Spark Sql. The Dataframe will have approx. 1000 columns.

I have tried Iterating over all the columns in the data frame and performed groupBy on each column. But the program is executing more than 1.5 hour

val df = sqlContext
      .read
      .format("org.apache.spark.sql.cassandra")
      .options(Map( "table" -> "exp", "keyspace" -> "testdata"))
      .load()


val groupedData= channelDf.columns.map(c => channelDf.groupBy(c).count().take(10).toList)
println("Printing Dataset :"+ dataset)

If I have columns in the Dataframe For Example Name and Amount then the output should be like

GroupBy on column Name:

Name    Count
Jon     2
Ram     5
David   3

GroupBy on column Amount:

Amount  Count
1000    4
2525    3
3000    3

I want the group by result for each column.

BlueSheepToken · Accepted Answer

The only way I can see a speed up here is to cache the df straight after reading it.

Unfortunately, each computation is independant, and you have to do them, there is no "work around".

Something like this can speed up a little bit, but not that much :

val df = sqlContext
      .read
      .format("org.apache.spark.sql.cassandra")
      .options(Map( "table" -> "exp", "keyspace" -> "testdata"))
      .load()
      .cache()

Group by and count on Spark Data frame all columns

Answers (1)

Related Questions