Apache Spark: grouping different rows together based on conditionals

Question

I am attempting to simplify a dataframe in Apache Spark (Python).

I have a dataframe like this

person   X    N   A     B     C     D
NCC1701 1   16309 false true  false false
NCC1864 1   16309 false false true  false
...

I want to group of each row's X & N, like groupBy('X','N'), but I want to get a count of how often each column A-D shows up, like false = 0 and true = 1 so I get a result like this

X    N     A B  C D
1    16309 0 1  1 0

In short, I am attempting to group together columns X and N, and get sums for each "true" and "false" for each pair of X and N. If 'true' and 'false' were exact numerics, I might know how to do this, but I don't know how to get 'true' as 1, and 'false' as 0, and then get sums.

How can I group the different cells together for each X and N?

thanks for your time

akuiper · Accepted Answer

Use the cast method to convert the data type from boolean to integer, and then do the sum:

import pyspark.sql.functions as f
cols = ['A', 'B', 'C', 'D']
df.groupBy('X', 'N').agg(*(f.sum(f.col(x).cast('int')).alias(x) for x in cols)).show()
+---+-----+---+---+---+---+
|  X|    N|  A|  B|  C|  D|
+---+-----+---+---+---+---+
|  1|16309|  0|  1|  1|  0|
+---+-----+---+---+---+---+

Apache Spark: grouping different rows together based on conditionals

Answers (1)

Related Questions