Reputation: 3257
I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable.
df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "Number"])
I do the following to sum the column.
df.groupBy().sum()
But I get a dataframe back.
+-----------+
|sum(Number)|
+-----------+
| 130|
+-----------+
I would 130 returned as an int stored in a variable to be used else where in the program.
result = 130
Upvotes: 67
Views: 208348
Reputation: 10891
Select column as RDD, abuse keys()
to get value in Row (or use .map(lambda x: x[0])
), then use RDD sum:
df.select("Number").rdd.keys().sum()
SQL sum using selectExpr
:
df.selectExpr("sum(Number)").first()[0]
Upvotes: 0
Reputation: 1207
Similar to other answers, but without the use of a groupby or agg. I just select the column in question, sum it, collect it, and then grab the first two indices to return an int. The only reason I chose this over the accepted answer is I am new to pyspark and was confused that the 'Number' column was not explicitly summed in the accepted answer. If I had to come back after sometime and try to understand what was happening, syntax such as below would be easier for me to follow.
import pyspark.sql.functions as f
df.select(f.sum('Number')).collect()[0][0]
Upvotes: 2
Reputation: 698
You can also try using first()
function. It returns the first row from the dataframe, and you can access values of respective columns using indices.
df.groupBy().sum().first()[0]
In your case, the result is a dataframe with single row and column, so above snippet works.
Upvotes: 0
Reputation: 2411
If you want a specific column :
import pyspark.sql.functions as F
df.agg(F.sum("my_column")).collect()[0][0]
Upvotes: 49
Reputation: 99
sometimes read a csv file to pyspark Dataframe, maybe the numeric column change to string type '23',like this, you should use pyspark.sql.functions.sum to get the result as int , not sum()
import pyspark.sql.functions as F
df.groupBy().agg(F.sum('Number')).show()
Upvotes: -3
Reputation: 1983
This is another way you can do this. using agg
and collect
:
sum_number = df.agg({"Number":"sum"}).collect()[0]
result = sum_number["sum(Number)"]
Upvotes: 17
Reputation: 844
The simplest way really :
df.groupBy().sum().collect()
But it is very slow operation: Avoid groupByKey, you should use RDD and reduceByKey:
df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1]
I tried on a bigger dataset and i measured the processing time:
RDD and ReduceByKey : 2.23 s
GroupByKey: 30.5 s
Upvotes: 39
Reputation: 784
I think the simplest way:
df.groupBy().sum().collect()
will return a list. In your example:
In [9]: df.groupBy().sum().collect()[0][0]
Out[9]: 130
Upvotes: 54
Reputation: 2696
The following should work:
df.groupBy().sum().rdd.map(lambda x: x[0]).collect()
Upvotes: -2