Reputation: 337
I am working in Pyspark and I have a data frame with the following columns.
Q1 = spark.read.csv("Q1final.csv",header = True, inferSchema = True)
Q1.printSchema()
root
|-- index_date: integer (nullable = true)
|-- item_id: integer (nullable = true)
|-- item_COICOP_CLASSIFICATION: integer (nullable = true)
|-- item_desc: string (nullable = true)
|-- index_algorithm: integer (nullable = true)
|-- stratum_ind: integer (nullable = true)
|-- item_index: double (nullable = true)
|-- all_gm_index: double (nullable = true)
|-- gm_ra_index: double (nullable = true)
|-- coicop_weight: double (nullable = true)
|-- item_weight: double (nullable = true)
|-- cpih_coicop_weight: double (nullable = true)
I need the sum of all the elements in the last column (cpih_coicop_weight) to use as a Double in other parts of my program. How can I do it? Thank you very much in advance!
Upvotes: 16
Views: 50051
Reputation: 3821
If you want just a double or int as return, the following function will work:
def sum_col(df, col):
return df.select(F.sum(col)).collect()[0][0]
Then
sum_col(Q1, 'cpih_coicop_weight')
will return the sum. I am new to pyspark so I am not sure why such a simple method of a column object is not in the library.
Upvotes: 27
Reputation: 1063
This can also be tried.
total = Q1.agg(F.sum("cpih_coicop_weight")).collect()
Upvotes: 5
Reputation: 15258
try this :
from pyspark.sql import functions as F
total = Q1.groupBy().agg(F.sum("cpih_coicop_weight")).collect()
In total
, you should have your result.
Upvotes: 10