Dusty
Dusty

Reputation: 181

PySpark: Fastest way of counting values in multiple columns

I need to count a value in several columns and I want all those individual count for each column in a list.

Is there a faster/better way of doing this? Because my solution takes quite some time.

dataframe.cache()
list = [dataframe.filter(col(str(i)) == "value").count() for i in range(150)]

Upvotes: 0

Views: 2020

Answers (2)

sam
sam

Reputation: 1896

You can try the following approach/design

  1. write a map function for each row of the data frame like this:
VALUE = 'value'

def row_mapper(df_row):
    return [each == VALUE for each in df_row]
  1. write a reduce function for data frame that takes 2 two rows as input:
def reduce_rows(df_row1, df_row2):
    return [x + y for x, y in zip(df_row1, df_row2)]

Note: these are simple python function to help you understand not some udf functions you can directly apply on PySpark.

Upvotes: 0

mck
mck

Reputation: 42422

You can do a conditional count aggregation:

import pyspark.sql.functions as F

df2 = df.agg(*[
    F.count(F.when(F.col(str(i)) == "value", 1)).alias(i) 
    for i in range(150)
])

result = df2.toPandas().transpose()[0].tolist()

Upvotes: 1

Related Questions