Summing the column values of all the rows in a Dataframe - scala/spark

Question

I am new to scala/spark. I am working on a scala/java application on spark, trying to read some data from a hive table and then sum up all the column values for each row. for example consider the following DF:

+--------+-+-+-+-+-+-+
| address|a|b|c|d|e|f|
+--------+-+-+-+-+-+-+
|Newyork |1|0|1|0|1|1|
|   LA   |0|1|1|1|0|1|
|Chicago |1|1|0|0|1|1|
+--------+-+-+-+-+-+-+

I want to sum up all the 1's in all the rows and get the total.i.e. the above dataframe's sum of all the columns should be 12 (since there are 12 number of 1's in all the rows combined)

I tried doing this:

var count = 0
DF.foreach( x => {
    count = count + Integer.parseInt(x.getAs[String]("a")) + Integer.parseInt(x.getAs[String]("b")) + Integer.parseInt(x.getAs[String]("c")) + Integer.parseInt(x.getAs[String]("d")) + Integer.parseInt(x.getAs[String]("e")) + Integer.parseInt(x.getAs[String]("f")) 
})

When I run the above code, the count value is still being zero. I think this has something to do with running the application on a cluster. So, declaring a variable and adding to it doesn't work for me as I have to run my application on a cluster. I also tried declaring static variable in a separate java class and adding to it - that gives me the same result.

As far as my knowledge goes, there should be an easy way of achieving this using the inline functions available in spark/scala libraries.

What would be an efficient way of achieving this? Any help would be appreciated.

Thank you.

P.S: I am using Spark 1.6.

akuiper · Accepted Answer

You can sum the columns values firstly which gives back a single Row data frame of sums, then you can convert this Row to a Seq and sum the values up:

val sum_cols = df.columns.tail.map(x => sum(col(x)))    
df.agg(sum_cols.head, sum_cols.tail: _*).first.toSeq.asInstanceOf[Seq[Long]].sum
// res9: Long = 12

df.agg(sum_cols.head, sum_cols.tail: _*).show
+------+------+------+------+------+------+
|sum(a)|sum(b)|sum(c)|sum(d)|sum(e)|sum(f)|
+------+------+------+------+------+------+
|     2|     2|     2|     1|     2|     3|
+------+------+------+------+------+------+

Summing the column values of all the rows in a Dataframe - scala/spark

Answers (2)

Related Questions