Reputation: 1350
from pyspark.sql import Row, functions as F
row = Row("UK_1","UK_2","Date","Cat")
agg = ''
agg = 'Cat'
tdf = (sc.parallelize
([
row(1,1,'12/10/2016',"A"),
row(1,2,None,'A'),
row(2,1,'14/10/2016','B'),
row(3,3,'!~2016/2/276','B'),
row(None,1,'26/09/2016','A'),
row(1,1,'12/10/2016',"A"),
row(1,2,None,'A'),
row(2,1,'14/10/2016','B'),
row(None,None,'!~2016/2/276','B'),
row(None,1,'26/09/2016','A')
]).toDF())
tdf.groupBy( iff(len(agg.strip()) > 0 , F.col(agg), )).agg(F.count('*').alias('row_count')).show()
Is there a way to use a column or no column based on some condition in the dataframe groupBy?
Upvotes: 0
Views: 2565
Reputation: 214957
You can provide an empty list to groupBy
if the condition you are looking for is not met, which will groupBy
no column:
tdf.groupBy(agg if len(agg) > 0 else []).agg(...)
agg = ''
tdf.groupBy(agg if len(agg) > 0 else []).agg(F.count('*').alias('row_count')).show()
+---------+
|row_count|
+---------+
| 10|
+---------+
agg = 'Cat'
tdf.groupBy(agg if len(agg) > 0 else []).agg(F.count('*').alias('row_count')).show()
+---+---------+
|Cat|row_count|
+---+---------+
| B| 4|
| A| 6|
+---+---------+
Upvotes: 1