Remove leading zeros pyspark?

I want to remove leading zeros of one column in pyspark?

enter image description here

Upvotes: 1

Views: 26850

Answers (2)

niuer
niuer

Reputation: 1669

Another way is to use regexp-replace here:

from pyspark.sql import functions as F
df.show()
df = df.withColumn('subcategory', F.regexp_replace('subcategory', r'0', ''))
df = df.withColumn('subcategory_label', F.regexp_replace('subcategory_label', r'0', ''))
df.show()

The input DataFrame:

+-----------+-----------------+--------+
|subcategory|subcategory_label|category|
+-----------+-----------------+--------+
|      00EEE|        00EEE FFF|   Drink|
|    0000EEE|        00EEE FFF|   Fruit|
|       0EEE|       000EEE FFF|    Meat|
+-----------+-----------------+--------+

The output DataFrame:

+-----------+-----------------+--------+
|subcategory|subcategory_label|category|
+-----------+-----------------+--------+
|        EEE|          EEE FFF|   Drink|
|        EEE|          EEE FFF|   Fruit|
|        EEE|          EEE FFF|    Meat|
+-----------+-----------------+--------+

If it needs the 0s to be at the beginning of the strings, you can use these to make sure no middle 0 get removed.:

df = df.withColumn('subcategory', F.regexp_replace('subcategory', r'^[0]*', ''))
df = df.withColumn('subcategory_label', F.regexp_replace('subcategory_label', r'^[0]*', ''))

Upvotes: 14

vishalv2050
vishalv2050

Reputation: 883

You can use lstrip('0') to get rid of leading 0's in a string. To do this via pyspark, make a UDF for the same

import pyspark.functions as F
udf = F.UserDefinedFunctions(lambda x: x.lstrip('0'), spark_types.StringType())

Upvotes: -1

Related Questions