Reputation: 181
I need to count a value in several columns and I want all those individual count for each column in a list.
Is there a faster/better way of doing this? Because my solution takes quite some time.
dataframe.cache()
list = [dataframe.filter(col(str(i)) == "value").count() for i in range(150)]
Upvotes: 0
Views: 2020
Reputation: 1896
You can try the following approach/design
VALUE = 'value'
def row_mapper(df_row):
return [each == VALUE for each in df_row]
def reduce_rows(df_row1, df_row2):
return [x + y for x, y in zip(df_row1, df_row2)]
Note: these are simple python function to help you understand not some udf functions you can directly apply on PySpark.
Upvotes: 0
Reputation: 42422
You can do a conditional count aggregation:
import pyspark.sql.functions as F
df2 = df.agg(*[
F.count(F.when(F.col(str(i)) == "value", 1)).alias(i)
for i in range(150)
])
result = df2.toPandas().transpose()[0].tolist()
Upvotes: 1