Reputation: 59
I have a table looks like this:
+--------+-------------+
| Time | Locations |
+--------+-------------+
| 1/1/22 | A300-abc |
+--------+-------------+
| 1/2/22 | A300-FFF |
+--------+-------------+
| 1/3/22 | A300-ABC123 |
+--------+-------------+
| 1/4/22 | B700-abc |
+--------+-------------+
| 1/5/22 | B750-EEE |
+--------+-------------+
| 1/6/22 | M-200-68 |
+--------+-------------+
| 1/7/22 | ABC-abc |
+--------+-------------+
I would like to derive to a table that looks like this:
+--------+-------------+-----------------+
| Time | Locations | Locations_Clean |
+--------+-------------+-----------------+
| 1/1/22 | A300-abc | A300 |
+--------+-------------+-----------------+
| 1/2/22 | A300 FFF | A300 |
+--------+-------------+-----------------+
| 1/3/22 | A300-ABC123 | A300 |
+--------+-------------+-----------------+
| 1/4/22 | B700-abc | B700 |
+--------+-------------+-----------------+
| 1/5/22 | B750-EEE | B750 |
+--------+-------------+-----------------+
| 1/6/22 | M-200-68 | M-200 |
+--------+-------------+-----------------+
| 1/7/22 | ABC-abc | "not_listed" |
+--------+-------------+-----------------+
Essentially I have a list of what the location code should be e.g. ["A300","B700","B750","M-200"], but currently the location column is very messy with other random strings. I want to create a new column that shows the "cleaned" version of the location code, and anything that is not in that list should be marked as "not_listed".
Upvotes: 0
Views: 387
Reputation: 26686
Use regex and when condition. In this case I check if string begins with a digit ^[0-9]
then extract the the leading digits in the string. If it doesn then attribute it with not listed. Code below
df=df.withColumn('Locations_Clean', when(col("Locations").rlike("^[0-9]"),regexp_extract('Locations','^[0-9]+',0)).otherwise(lit('not_listed'))).show()
+--------------------+---------+---------------+
| Time|Locations|Locations_Clean|
+--------------------+---------+---------------+
|0.045454545454545456| 300abc| 300|
|0.022727272727272728| 300FFF| 300|
| 0.01515151515151515| 300ABC| 300|
|0.011363636363636364| 700abc| 700|
|0.009090909090909092| 750EEE| 750|
|0.007575757575757575| ABCabc| not_listed|
+--------------------+---------+---------------+
With your new question, use regexp_replace
df=df.withColumn('Locations_Clean', when(col("Locations").rlike("\d"),regexp_replace('Locations','\-\w+$','')).otherwise(lit('not_listed')))
+------+-----------+---------------+
| Time| Locations|Locations_Clean|
+------+-----------+---------------+
|1/1/22| A300-abc| A300|
|1/2/22| A300-FFF| A300|
|1/3/22|A300-ABC123| A300|
|1/4/22| B700-abc| B700|
|1/5/22| B750-EEE| B750|
|1/7/22| M-200-68| M-200|
|1/6/22| ABCabc| not_listed|
+------+-----------+---------------+
Upvotes: 1