slysid
slysid

Reputation: 5498

PySpark - When Otherwise - Condition should be a Column

I have a dataframe as below

root
 |-- tasin: string (nullable = true)
 |-- advertiser_id: decimal(38,10) (nullable = true)
 |-- predicted_sp_sold_units: decimal(38,10) (nullable = true)
 |-- predicted_sp_impressions: decimal(38,10) (nullable = true)
 |-- predicted_sp_clicks: decimal(38,10) (nullable = true)
 |-- predicted_sdc_sold_units: decimal(38,10) (nullable = true)
 |-- predicted_sdc_impressions: decimal(38,10) (nullable = true)
 |-- predicted_sdc_clicks: decimal(38,10) (nullable = true)
 |-- predicted_sda_sold_units: decimal(38,10) (nullable = true)
 |-- predicted_sda_impressions: decimal(38,10) (nullable = true)
 |-- predicted_sda_clicks: decimal(38,10) (nullable = true)
 |-- region_id: integer (nullable = true)
 |-- marketplace_id: integer (nullable = true)
 |-- dataset_date: date (nullable = true)

Now I am using the below select statement. I am looking for presence of a column name and if present select the value or else fill with Null. The dataframe is stored in df variable.

scores_df1 = df.select(
            col('marketplace_id'),
            col('region_id'),
            col('tasin'),
            col('advertiser_id'),
            col('predicted_sp_sold_units'),
            col('predicted_sp_impressions'),
            col('predicted_sp_clicks'),
            col('predicted_sdc_sold_units'),
            col('predicted_sdc_impressions'),
            col('predicted_sdc_clicks'),
            col('predicted_sda_sold_units'),
            col('predicted_sda_impressions'),
            col('predicted_sda_clicks'),
            when('sdcr_score' in df.columns is True, col('sdcr_score')).otherwise(lit(None)).alias('sdcr_score'),
            when('sdar_score' in df.columns is True, col('sdar_score')).otherwise(lit(None)).alias('sdar_score')
    )

I am receiving error <class 'TypeError'>: condition should be a Column

Please advice what is wrong

Upvotes: 0

Views: 1809

Answers (1)

walking
walking

Reputation: 960

the phrase 'sdcr_score' in df.columns is True is evaluated in Python before moving to spark and return True/False. So what you are passing to spark is: when(True, ..., ...).

When is expecting the first argument to be a Column that is evaluated to a True/False statement and not a Pythonic Bool.

You can wrap the argument with lit() function which will basically pass a True/False column to all arguments of the when clause.

Upvotes: 1

Related Questions