Nabih Bawazir
Nabih Bawazir

Reputation: 7265

Filling various missing categorical data in pyspark

I try to fill various missing categorical data in PySpark. I need to map all 000 and null to be 'UNDEFINED' but it turns converting all others data item to be null. Here's my Input

lastvalue_month.select('province').show(50)

+--------------------+
|            province|
+--------------------+
|         DKI JAKARTA|
|         DKI JAKARTA|
|          JAWA BARAT|
|          JAWA BARAT|
|          JAWA BARAT|
|              BANTEN|
|              BANTEN|
|                BALI|
|                BALI|
|                 000|
|         DKI JAKARTA|
|          JAWA BARAT|
|          JAWA BARAT|
|DAERAH ISTIMEWA Y...|
|                 000|
|          JAWA BARAT|
|          JAWA BARAT|
|      KEPULAUAN RIAU|
|      SUMATERA UTARA|
|      SUMATERA UTARA|
|          JAWA BARAT|
|          JAWA BARAT|
|              BANTEN|
|          JAWA BARAT|
|          JAWA BARAT|
|          JAWA BARAT|
|          JAWA BARAT|
|                 000|
|         DKI JAKARTA|
|              BANTEN|
|          JAWA BARAT|
|                 000|
|         DKI JAKARTA|
|         DKI JAKARTA|
|              BANTEN|
|         DKI JAKARTA|
|         DKI JAKARTA|
|         JAWA TENGAH|
|          JAWA BARAT|
|              BANTEN|
|         DKI JAKARTA|
|DAERAH ISTIMEWA Y...|
|              BANTEN|
|          JAWA BARAT|
|         DKI JAKARTA|
|         DKI JAKARTA|
|          JAWA BARAT|
|          JAWA BARAT|
|          JAWA TIMUR|
|         DKI JAKARTA|
+--------------------+

What I did

lastvalue_month = lastvalue_month.withColumn('province', when((col('province') == '000') | col('province').isNull(), lit('UNDEFINED')))

My output

+---------+
| province|
+---------+
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|UNDEFINED|
|     null|
|     null|
|     null|
|     null|
|UNDEFINED|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|UNDEFINED|
|     null|
|     null|
|     null|
|UNDEFINED|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
+---------+

All filled data becoming null. What I need is all 000 and null to be 'UNDEFINED'. how to do this in PySpark?

Upvotes: 0

Views: 334

Answers (1)

Vaebhav
Vaebhav

Reputation: 5052

I think you should also have an otherwise condition , to assign values that do not satisfy the when condition.

lastvalue_month.withColumn('province', when(
                            (col('province') == '000') | col('province').isNull(), lit('UNDEFINED')
                    ).otherwise(col('province')
                )
            )

Upvotes: 1

Related Questions