E. Camus
E. Camus

Reputation: 149

Summing values across each row as boolean (PySpark)

I currently have a PySpark dataframe that has many columns populated by integer counts. Many of these columns have counts of zero. I would like to find a way to sum how many columns have counts greater than zero.

In other words, I would like an approach that sums values across a row, where all the columns for a given row are effectively boolean (although the datatype conversion may not be necessary). Several columns in my table are datetime or string, so ideally I would have an approach that first selects the numeric columns.

Current Dataframe example and Desired Output

+---+---------- +----------+------------            
|USER|   DATE   |COUNT_COL1| COUNT_COL2|...     DESIRED COLUMN
+---+---------- +----------+------------ 
| b | 7/1/2019 |  12      |     1     |              2        (2 columns are non-zero)
| a | 6/9/2019 |  0       |     5     |              1
| c | 1/1/2019 |  0       |     0     |              0

Pandas: As an example, in pandas this can be accomplished by selecting the numeric columns,converting to bool and summing with the axis=1. I am looking for a PySpark equivalent.

test_cols=list(pandas_df.select_dtypes(include=[np.number]).columns.values)
pandas_df[test_cols].astype(bool).sum(axis=1)

Upvotes: 1

Views: 1952

Answers (2)

chlebek
chlebek

Reputation: 2451

Let's say you have below df:

df.show()
df.printSchema()

+---+---+---+---+
|_c0|_c1|_c2|_c3|
+---+---+---+---+
|  a|  1|  2|  3|
|  a|  0|  2|  1|
|  a|  0|  0|  1|
|  a|  0|  0|  0|
+---+---+---+---+

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)

Using case when statement you can check if column is numeric and then if it is larger than 0. In the next step f.size will return count thanks to f.array_remove which left only cols with True value.

from pyspark.sql import functions as f
cols = [f.when(f.length(f.regexp_replace(f.col(x), '\\d+', '')) > 0, False).otherwise(f.col(x).cast('int') > 0) for x in df2.columns]
df.select("*", f.size(f.array_remove(f.array(*cols), False)).alias("count")).show()

+---+---+---+---+-----+
|_c0|_c1|_c2|_c3|count|
+---+---+---+---+-----+
|  a|  1|  2|  3|    3|
|  a|  0|  2|  1|    2|
|  a|  0|  0|  1|    1|
|  a|  0|  0|  0|    0|
+---+---+---+---+-----+

Upvotes: 0

murtihash
murtihash

Reputation: 8410

For numericals, you can do it by creating an array of all the columns with the integer values(using df.dtypes), and then use higher order functions. In this case I used filter to get rid of all 0s, and then used size to get the number of all non zero elements per row.(spark2.4+)

from pyspark.sql import functions as F
df.withColumn("arr", F.array(*[F.col(i[0]) for i in df.dtypes if i[1] in ['int','bigint']]))\
  .withColumn("DESIRED COLUMN", F.expr("""size(filter(arr,x->x!=0))""")).drop("arr").show()

#+----+--------+----------+----------+--------------+
#|USER|    DATE|COUNT_COL1|COUNT_COL2|DESIRED COLUMN|
#+----+--------+----------+----------+--------------+
#|   b|7/1/2019|        12|         1|             2|
#|   a|6/9/2019|         0|         5|             1|
#|   c|1/1/2019|         0|         0|             0|
#+----+--------+----------+----------+--------------+

Upvotes: 2

Related Questions