Loki
Loki

Reputation: 41

Groupby with Pyspark through filters

I have a df derived from clustering that looks like this:

Cluster Variable 1 Variable 2
0 334 32
0 0 45
3 453 0
3 320 0
0 0 28
1 467 49
3 324 16
1 58 2

And i'm trying to achive the next result for each cluster and every variable:

Variable 1

Cluster %of0 %ofvals != 0 Count of vals != 0 Sum of values %universe
0 67 33 1 334 17
1 0 100 2 525 27
3 0 100 3 1097 56

Variable 2

Cluster %of0 %ofvals != 0 Count of vals != 0 Sum of values %universe
0 0 100 0 105 61
1 0 100 0 51 29
3 67 33 1 16 10

Note: % universe is the total sum of values of every variable, in this case for variable 1 would be: 334 + 525 + 1097 = 1956 (this is 100% so 334 its 17% of this total).

I'm in the process of learning Pyspark and I'm struggling with the syntax, this is the code I'm trying but i'm at loss because I don´t know how to manage the filterings to iterate for variable and for cluster:

for i in list_of_variables:
   print(i)
   df.groupBy('Cluster').agg((count((col(i) == 0) / df.filter(col('Cluster') == 0).count()) * 100).alias('% of 0'), (count((col(i) != 0) / df.filter(col('Cluster') == 0).count() * 100).alias('% of vals diff than 0')..

I would be very grateful for any ideas that could give me light on how to materialize this objective. Have an awesome day!

Upvotes: 0

Views: 49

Answers (1)

Victor Arima
Victor Arima

Reputation: 26

Maybe you could try with something like this to obtain the part of counts:

for i in list:
   print(i)
   output = df.filter(col(i) != 0).groupBy(col('Cluster')).agg(
    count(col('*')).alias('Count_vals_dif_0')).show()

Upvotes: 1

Related Questions