Tronald Dump
Tronald Dump

Reputation: 1350

Pyspark DataFrame Conditional groupBy

from pyspark.sql import Row, functions as F
row = Row("UK_1","UK_2","Date","Cat")
agg = ''
agg = 'Cat'
tdf = (sc.parallelize
    ([
        row(1,1,'12/10/2016',"A"),
        row(1,2,None,'A'),
        row(2,1,'14/10/2016','B'),
        row(3,3,'!~2016/2/276','B'),
        row(None,1,'26/09/2016','A'),
        row(1,1,'12/10/2016',"A"),
        row(1,2,None,'A'),
        row(2,1,'14/10/2016','B'),
        row(None,None,'!~2016/2/276','B'),
        row(None,1,'26/09/2016','A')
        ]).toDF())
tdf.groupBy(  iff(len(agg.strip()) > 0 , F.col(agg),  )).agg(F.count('*').alias('row_count')).show()

Is there a way to use a column or no column based on some condition in the dataframe groupBy?

Upvotes: 0

Views: 2565

Answers (1)

akuiper
akuiper

Reputation: 214957

You can provide an empty list to groupBy if the condition you are looking for is not met, which will groupBy no column:

tdf.groupBy(agg if len(agg) > 0 else []).agg(...)

agg = ''
tdf.groupBy(agg if len(agg) > 0 else []).agg(F.count('*').alias('row_count')).show()
+---------+
|row_count|
+---------+
|       10|
+---------+

agg = 'Cat'
tdf.groupBy(agg if len(agg) > 0 else []).agg(F.count('*').alias('row_count')).show()
+---+---------+
|Cat|row_count|
+---+---------+
|  B|        4|
|  A|        6|
+---+---------+

Upvotes: 1

Related Questions