Reputation: 13
I want to Perform Group by on each column of the data frame using Spark Sql. The Dataframe will have approx. 1000 columns.
I have tried Iterating over all the columns in the data frame and performed groupBy on each column. But the program is executing more than 1.5 hour
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "exp", "keyspace" -> "testdata"))
.load()
val groupedData= channelDf.columns.map(c => channelDf.groupBy(c).count().take(10).toList)
println("Printing Dataset :"+ dataset)
If I have columns in the Dataframe For Example Name and Amount then the output should be like
GroupBy on column Name:
Name Count
Jon 2
Ram 5
David 3
GroupBy on column Amount:
Amount Count
1000 4
2525 3
3000 3
I want the group by result for each column.
Upvotes: 1
Views: 684
Reputation: 6099
The only way I can see a speed up here is to cache the df
straight after reading it.
Unfortunately, each computation is independant, and you have to do them, there is no "work around".
Something like this can speed up a little bit, but not that much :
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "exp", "keyspace" -> "testdata"))
.load()
.cache()
Upvotes: 0