Reputation: 41
I have a df derived from clustering that looks like this:
Cluster | Variable 1 | Variable 2 |
---|---|---|
0 | 334 | 32 |
0 | 0 | 45 |
3 | 453 | 0 |
3 | 320 | 0 |
0 | 0 | 28 |
1 | 467 | 49 |
3 | 324 | 16 |
1 | 58 | 2 |
And i'm trying to achive the next result for each cluster and every variable:
Variable 1
Cluster | %of0 | %ofvals != 0 | Count of vals != 0 | Sum of values | %universe |
---|---|---|---|---|---|
0 | 67 | 33 | 1 | 334 | 17 |
1 | 0 | 100 | 2 | 525 | 27 |
3 | 0 | 100 | 3 | 1097 | 56 |
Variable 2
Cluster | %of0 | %ofvals != 0 | Count of vals != 0 | Sum of values | %universe |
---|---|---|---|---|---|
0 | 0 | 100 | 0 | 105 | 61 |
1 | 0 | 100 | 0 | 51 | 29 |
3 | 67 | 33 | 1 | 16 | 10 |
Note: % universe is the total sum of values of every variable, in this case for variable 1 would be: 334 + 525 + 1097 = 1956 (this is 100% so 334 its 17% of this total).
I'm in the process of learning Pyspark and I'm struggling with the syntax, this is the code I'm trying but i'm at loss because I don´t know how to manage the filterings to iterate for variable and for cluster:
for i in list_of_variables:
print(i)
df.groupBy('Cluster').agg((count((col(i) == 0) / df.filter(col('Cluster') == 0).count()) * 100).alias('% of 0'), (count((col(i) != 0) / df.filter(col('Cluster') == 0).count() * 100).alias('% of vals diff than 0')..
I would be very grateful for any ideas that could give me light on how to materialize this objective. Have an awesome day!
Upvotes: 0
Views: 49
Reputation: 26
Maybe you could try with something like this to obtain the part of counts:
for i in list:
print(i)
output = df.filter(col(i) != 0).groupBy(col('Cluster')).agg(
count(col('*')).alias('Count_vals_dif_0')).show()
Upvotes: 1