Reputation: 7265
I try to fill various missing categorical data in PySpark. I need to map all 000
and null
to be 'UNDEFINED' but it turns converting all others data item to be null
. Here's my Input
lastvalue_month.select('province').show(50)
+--------------------+
| province|
+--------------------+
| DKI JAKARTA|
| DKI JAKARTA|
| JAWA BARAT|
| JAWA BARAT|
| JAWA BARAT|
| BANTEN|
| BANTEN|
| BALI|
| BALI|
| 000|
| DKI JAKARTA|
| JAWA BARAT|
| JAWA BARAT|
|DAERAH ISTIMEWA Y...|
| 000|
| JAWA BARAT|
| JAWA BARAT|
| KEPULAUAN RIAU|
| SUMATERA UTARA|
| SUMATERA UTARA|
| JAWA BARAT|
| JAWA BARAT|
| BANTEN|
| JAWA BARAT|
| JAWA BARAT|
| JAWA BARAT|
| JAWA BARAT|
| 000|
| DKI JAKARTA|
| BANTEN|
| JAWA BARAT|
| 000|
| DKI JAKARTA|
| DKI JAKARTA|
| BANTEN|
| DKI JAKARTA|
| DKI JAKARTA|
| JAWA TENGAH|
| JAWA BARAT|
| BANTEN|
| DKI JAKARTA|
|DAERAH ISTIMEWA Y...|
| BANTEN|
| JAWA BARAT|
| DKI JAKARTA|
| DKI JAKARTA|
| JAWA BARAT|
| JAWA BARAT|
| JAWA TIMUR|
| DKI JAKARTA|
+--------------------+
What I did
lastvalue_month = lastvalue_month.withColumn('province', when((col('province') == '000') | col('province').isNull(), lit('UNDEFINED')))
My output
+---------+
| province|
+---------+
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
|UNDEFINED|
| null|
| null|
| null|
| null|
|UNDEFINED|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
|UNDEFINED|
| null|
| null|
| null|
|UNDEFINED|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
+---------+
All filled data becoming null
. What I need is all 000
and null
to be 'UNDEFINED'. how to do this in PySpark?
Upvotes: 0
Views: 334
Reputation: 5052
I think you should also have an otherwise
condition , to assign values that do not satisfy the when
condition.
lastvalue_month.withColumn('province', when(
(col('province') == '000') | col('province').isNull(), lit('UNDEFINED')
).otherwise(col('province')
)
)
Upvotes: 1