Reputation: 6123
I am attempting to simplify a dataframe in Apache Spark (Python).
I have a dataframe like this
person X N A B C D
NCC1701 1 16309 false true false false
NCC1864 1 16309 false false true false
...
I want to group of each row's X & N, like groupBy('X','N'), but I want to get a count of how often each column A-D shows up, like false = 0 and true = 1 so I get a result like this
X N A B C D
1 16309 0 1 1 0
In short, I am attempting to group together columns X and N, and get sums for each "true" and "false" for each pair of X and N. If 'true' and 'false' were exact numerics, I might know how to do this, but I don't know how to get 'true' as 1, and 'false' as 0, and then get sums.
How can I group the different cells together for each X and N?
thanks for your time
Upvotes: 0
Views: 55
Reputation: 215117
Use the cast
method to convert the data type from boolean to integer, and then do the sum
:
import pyspark.sql.functions as f
cols = ['A', 'B', 'C', 'D']
df.groupBy('X', 'N').agg(*(f.sum(f.col(x).cast('int')).alias(x) for x in cols)).show()
+---+-----+---+---+---+---+
| X| N| A| B| C| D|
+---+-----+---+---+---+---+
| 1|16309| 0| 1| 1| 0|
+---+-----+---+---+---+---+
Upvotes: 2