Groupby with Pyspark through filters

Question

I have a df derived from clustering that looks like this:

Cluster	Variable 1	Variable 2
0	334	32
0	0	45
3	453	0
3	320	0
0	0	28
1	467	49
3	324	16
1	58	2

And i'm trying to achive the next result for each cluster and every variable:

Variable 1

Cluster	%of0	%ofvals != 0	Count of vals != 0	Sum of values	%universe
0	67	33	1	334	17
1	0	100	2	525	27
3	0	100	3	1097	56

Variable 2

Cluster	%of0	%ofvals != 0	Count of vals != 0	Sum of values	%universe
0	0	100	0	105	61
1	0	100	0	51	29
3	67	33	1	16	10

Note: % universe is the total sum of values of every variable, in this case for variable 1 would be: 334 + 525 + 1097 = 1956 (this is 100% so 334 its 17% of this total).

I'm in the process of learning Pyspark and I'm struggling with the syntax, this is the code I'm trying but i'm at loss because I don´t know how to manage the filterings to iterate for variable and for cluster:

for i in list_of_variables:
   print(i)
   df.groupBy('Cluster').agg((count((col(i) == 0) / df.filter(col('Cluster') == 0).count()) * 100).alias('% of 0'), (count((col(i) != 0) / df.filter(col('Cluster') == 0).count() * 100).alias('% of vals diff than 0')..

I would be very grateful for any ideas that could give me light on how to materialize this objective. Have an awesome day!

Victor Arima · Accepted Answer

Maybe you could try with something like this to obtain the part of counts:

for i in list:
   print(i)
   output = df.filter(col(i) != 0).groupBy(col('Cluster')).agg(
    count(col('*')).alias('Count_vals_dif_0')).show()

Groupby with Pyspark through filters

Answers (1)

Related Questions