Baobab
Baobab

Reputation: 5

Remove non-ASCII and specific characters from a dataframe column using Pyspark

I would to clean up data in a dataframe column City. It can have the following values:

Venice® VeniceÆ Venice? Venice Venice® Venice

I would like to remove all the non ascii characters as well as ?, and . How can I achieve it?

Upvotes: 0

Views: 509

Answers (1)

Alex Ortner
Alex Ortner

Reputation: 1228

You can clean up strings with Regex by filtering only on letters

# create dataframes
date_data = [
    (1,"Venice®"),
    (2,"VeniceÆ"),
    (3,"Venice?"),
    (4,"Venice")]

schema = ["id","name"]
df_raw = spark.createDataFrame(data=date_data, schema = schema)
df_raw.show()

+---+--------+
|id |name    |
+---+--------+
|1  |Venice®|
|2  |VeniceÆ |
|3  |Venice? |
|4  |Venice  |
+---+--------+

# apply regular expression
df_clean=(df_raw.withColumn("clean_name",f.regexp_replace(f.col("name"), "[^a-zA-Z]", "")))
df_clean.show()

+---+--------+----------+
| id|    name|clean_name|
+---+--------+----------+
|  1|Venice®|    Venice|
|  2| VeniceÆ|    Venice|
|  3| Venice?|    Venice|
|  4|  Venice|    Venice|
+---+--------+----------+

PS: But I doubt that you see such characters after correct import to spark. Superscript for example is ignored

Upvotes: 1

Related Questions