Michail N
Michail N

Reputation: 3855

Pyspark dataframe get a list of columns where at least one row meets a condition

I have a PySpark DataFrame

Col1 Col2 Col3
0.1  0.2  0.3

I want to get the column names where at least one row meets a condition for example a row is bigger than 0.1

My expected result is should be in this case:

[Co2 , Co3]

I cannot provide any code because truly I don't know how to do this.

Upvotes: 3

Views: 2704

Answers (1)

zero323
zero323

Reputation: 330343

Just count items satisfying the predicate (internal select) and process the results:

from pyspark.sql.functions import col, count, when

[c for c, v in df.select([
    count(when(col(c) > 0.1, 1)).alias(c) for c in df.columns
]).first().asDict().items() if v]

Step by step:

  • Aggregate (DataFrame -> DatFrame):

    df = sc.parallelize([(0.1, 0.2, 0.3)]).toDF()
    counts = df.select([
        count(when(col(c) > 0.1, 1)).alias(c) for c in df.columns
    ])
    
    DataFrame[_1: bigint, _2: bigint, _3: bigint]
    
  • collect the first Row:

    a_row = counts.first()
    
    Row(_1=0, _2=1, _3=1)
    
  • Convert to Python dict:

    a_dict = a_row.asDict()
    
    {'_1': 0, '_2': 1, '_3': 1}
    
  • And iterate over its items, keeping key, when value is truthy:

    [c for c, v in a_dict.items() if v]
    

    or explicitly checking count:

    [c for c, v in a_dict.items() if v > 0]
    

Upvotes: 7

Related Questions