Nabih Bawazir
Nabih Bawazir

Reputation: 7255

Make a not available column in PySpark dataframe full of zero

I have a script which is currently running in sample dataframe. Here's my code:

fd_orion_apps = fd_orion_apps.groupBy('msisdn', 'apps_id').pivot('apps_id').count().select('msisdn', *parameter_cut.columns).fillna(0)

After pivoting, probably some columns in parameter_cut are not available on df fd_orion_apps, and it will give the error like this:

Py4JJavaError: An error occurred while calling o1381.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`8602`' given input columns: [7537, 7011, 2658, 3582, 12120, 31049, 35010, 16615, 10003, 15067, 1914, 1436, 6032, 422, 10636, 10388, 877,...

Upvotes: 1

Views: 311

Answers (1)

ZygD
ZygD

Reputation: 24386

You can separate the select into a different step. Then you will be able to use a conditional expression together with list comprehension.

from pyspark.sql import functions as F

fd_orion_apps = fd_orion_apps.groupBy('msisdn', 'apps_id').pivot('apps_id').count()
fd_orion_apps = fd_orion_apps.select(
    'msisdn',
    *[c if c in fd_orion_apps.columns else F.lit(0).alias(c) for c in parameter_cut.columns]
)

Upvotes: 1

Related Questions