Summing values across each row as boolean (PySpark)

Question

I currently have a PySpark dataframe that has many columns populated by integer counts. Many of these columns have counts of zero. I would like to find a way to sum how many columns have counts greater than zero.

In other words, I would like an approach that sums values across a row, where all the columns for a given row are effectively boolean (although the datatype conversion may not be necessary). Several columns in my table are datetime or string, so ideally I would have an approach that first selects the numeric columns.

Current Dataframe example and Desired Output

+---+---------- +----------+------------            
|USER|   DATE   |COUNT_COL1| COUNT_COL2|...     DESIRED COLUMN
+---+---------- +----------+------------ 
| b | 7/1/2019 |  12      |     1     |              2        (2 columns are non-zero)
| a | 6/9/2019 |  0       |     5     |              1
| c | 1/1/2019 |  0       |     0     |              0

Pandas: As an example, in pandas this can be accomplished by selecting the numeric columns,converting to bool and summing with the axis=1. I am looking for a PySpark equivalent.

test_cols=list(pandas_df.select_dtypes(include=[np.number]).columns.values)
pandas_df[test_cols].astype(bool).sum(axis=1)

murtihash · Accepted Answer

For numericals, you can do it by creating an array of all the columns with the integer values(using df.dtypes), and then use higher order functions. In this case I used filter to get rid of all 0s, and then used size to get the number of all non zero elements per row.(spark2.4+)

from pyspark.sql import functions as F
df.withColumn("arr", F.array(*[F.col(i[0]) for i in df.dtypes if i[1] in ['int','bigint']]))\
  .withColumn("DESIRED COLUMN", F.expr("""size(filter(arr,x->x!=0))""")).drop("arr").show()

#+----+--------+----------+----------+--------------+
#|USER|    DATE|COUNT_COL1|COUNT_COL2|DESIRED COLUMN|
#+----+--------+----------+----------+--------------+
#|   b|7/1/2019|        12|         1|             2|
#|   a|6/9/2019|         0|         5|             1|
#|   c|1/1/2019|         0|         0|             0|
#+----+--------+----------+----------+--------------+

Summing values across each row as boolean (PySpark)

Answers (2)

Related Questions