Reputation: 133
I want to remove leading zeros of one column in pyspark?
Upvotes: 1
Views: 26850
Reputation: 1669
Another way is to use regexp-replace
here:
from pyspark.sql import functions as F
df.show()
df = df.withColumn('subcategory', F.regexp_replace('subcategory', r'0', ''))
df = df.withColumn('subcategory_label', F.regexp_replace('subcategory_label', r'0', ''))
df.show()
The input DataFrame:
+-----------+-----------------+--------+
|subcategory|subcategory_label|category|
+-----------+-----------------+--------+
| 00EEE| 00EEE FFF| Drink|
| 0000EEE| 00EEE FFF| Fruit|
| 0EEE| 000EEE FFF| Meat|
+-----------+-----------------+--------+
The output DataFrame:
+-----------+-----------------+--------+
|subcategory|subcategory_label|category|
+-----------+-----------------+--------+
| EEE| EEE FFF| Drink|
| EEE| EEE FFF| Fruit|
| EEE| EEE FFF| Meat|
+-----------+-----------------+--------+
If it needs the 0
s to be at the beginning of the strings, you can use these to make sure no middle 0
get removed.:
df = df.withColumn('subcategory', F.regexp_replace('subcategory', r'^[0]*', ''))
df = df.withColumn('subcategory_label', F.regexp_replace('subcategory_label', r'^[0]*', ''))
Upvotes: 14
Reputation: 883
You can use lstrip('0')
to get rid of leading 0's in a string.
To do this via pyspark, make a UDF for the same
import pyspark.functions as F
udf = F.UserDefinedFunctions(lambda x: x.lstrip('0'), spark_types.StringType())
Upvotes: -1