Reputation: 5
I would to clean up data in a dataframe column City. It can have the following values:
Venice® VeniceÆ Venice? Venice Venice® Venice
I would like to remove all the non ascii characters as well as ?, and . How can I achieve it?
Upvotes: 0
Views: 509
Reputation: 1228
You can clean up strings with Regex by filtering only on letters
# create dataframes
date_data = [
(1,"Venice®"),
(2,"VeniceÆ"),
(3,"Venice?"),
(4,"Venice")]
schema = ["id","name"]
df_raw = spark.createDataFrame(data=date_data, schema = schema)
df_raw.show()
+---+--------+
|id |name |
+---+--------+
|1 |Venice®|
|2 |VeniceÆ |
|3 |Venice? |
|4 |Venice |
+---+--------+
# apply regular expression
df_clean=(df_raw.withColumn("clean_name",f.regexp_replace(f.col("name"), "[^a-zA-Z]", "")))
df_clean.show()
+---+--------+----------+
| id| name|clean_name|
+---+--------+----------+
| 1|Venice®| Venice|
| 2| VeniceÆ| Venice|
| 3| Venice?| Venice|
| 4| Venice| Venice|
+---+--------+----------+
PS: But I doubt that you see such characters after correct import to spark. Superscript for example is ignored
Upvotes: 1