Reputation: 1739
I have applied a groupby and calculating the standard deviation for two features in pyspark dataframe
from pyspark.sql import functions as f
val1 = [('a',20,100),('a',100,100),('a',50,100),('b',0,100),('b',0,100),('c',0,0),('c',0,50),('c',0,100),('c',0,20)]
cols = ['group','val1','val2']
tf = spark.createDataFrame(val1, cols)
tf.show()
tf.groupby('group').agg(f.stddev(['val1','val2']).alias('val1_std','val2_std'))
but it is giving me following error
TypeError: _() takes 1 positional argument but 2 were given
How to perform it in pyspark?
Upvotes: 0
Views: 539
Reputation: 2696
The problem is that the stddev
function acts on a single column rather than multiple columns as in the code you have written (hence the error message about 1 vs 2 arguments). One way to get what you are looking for is to calculate the standard deviation separately for each column:
# std dev for each col
expressions = [f.stddev(col).alias('%s_std'%(col)) for col in ['val1','val2']]
# Now run it
tf.groupby('group').agg(*expressions).show()
#+-----+------------------+------------------+
#|group| val1_std| val2_std|
#+-----+------------------+------------------+
#| c| 0.0|43.493294502332965|
#| b| 0.0| 0.0|
#| a|40.414518843273804| 0.0|
#+-----+------------------+------------------+
Upvotes: 1