Reputation: 882
I have dataframe with columns having some string values.
col1|col2
---------
aaa |bbb
ccc |ddd
aaa |ddd
eee |fff
I have to get number of allowed values ({aaa,ddd}
) present in each columns.
cond = "`col1` = 'aaa' OR `col1` = 'ddd'"
dataframe.where(F.expr(cond)).count()
By this way we are getting required values. We are looping through all columns and perform this operation on each column.
This approach takes hours to process when number of columns increased to 2000.
Is there a better and faster approach for processing all columns parallely?
Upvotes: 1
Views: 674
Reputation: 9247
One alternative is to use list comprehension in Python to apply the same condition on all columns of the dataframe
import pyspark.sql.functions as F
ok_values = ['aaa', 'ddd']
dataframe = dataframe.select(
*[F.sum((F.col(c).isin(ok_values)).cast('integer')).alias(c) for c in dataframe.columns]
)
dataframe.show()
+----+----+
|col1|col2|
+----+----+
| 2| 2|
+----+----+
Upvotes: 1