Reputation: 3855
I have a PySpark DataFrame
Col1 Col2 Col3
0.1 0.2 0.3
I want to get the column names where at least one row meets a condition for example a row is bigger than 0.1
My expected result is should be in this case:
[Co2 , Co3]
I cannot provide any code because truly I don't know how to do this.
Upvotes: 3
Views: 2704
Reputation: 330343
Just count
items satisfying the predicate (internal select
) and process the results:
from pyspark.sql.functions import col, count, when
[c for c, v in df.select([
count(when(col(c) > 0.1, 1)).alias(c) for c in df.columns
]).first().asDict().items() if v]
Step by step:
Aggregate (DataFrame
-> DatFrame
):
df = sc.parallelize([(0.1, 0.2, 0.3)]).toDF()
counts = df.select([
count(when(col(c) > 0.1, 1)).alias(c) for c in df.columns
])
DataFrame[_1: bigint, _2: bigint, _3: bigint]
collect
the first
Row
:
a_row = counts.first()
Row(_1=0, _2=1, _3=1)
Convert to Python dict
:
a_dict = a_row.asDict()
{'_1': 0, '_2': 1, '_3': 1}
And iterate over its items, keeping key, when value is truthy:
[c for c, v in a_dict.items() if v]
or explicitly checking count:
[c for c, v in a_dict.items() if v > 0]
Upvotes: 7