DPatrick
DPatrick

Reputation: 59

How to create a new column based on if certain strings exist in another column?

I have a table looks like this:

+--------+-------------+
| Time   | Locations   |
+--------+-------------+
| 1/1/22 | A300-abc    |
+--------+-------------+
| 1/2/22 | A300-FFF    |
+--------+-------------+
| 1/3/22 | A300-ABC123 |
+--------+-------------+
| 1/4/22 | B700-abc    |
+--------+-------------+
| 1/5/22 | B750-EEE    |
+--------+-------------+
| 1/6/22 | M-200-68    |
+--------+-------------+
| 1/7/22 | ABC-abc     |
+--------+-------------+

I would like to derive to a table that looks like this:

+--------+-------------+-----------------+
| Time   | Locations   | Locations_Clean |
+--------+-------------+-----------------+
| 1/1/22 | A300-abc    | A300            |
+--------+-------------+-----------------+
| 1/2/22 | A300 FFF    | A300            |
+--------+-------------+-----------------+
| 1/3/22 | A300-ABC123 | A300            |
+--------+-------------+-----------------+
| 1/4/22 | B700-abc    | B700            |
+--------+-------------+-----------------+
| 1/5/22 | B750-EEE    | B750            |
+--------+-------------+-----------------+
| 1/6/22 | M-200-68    | M-200           |
+--------+-------------+-----------------+
| 1/7/22 | ABC-abc     | "not_listed"    |
+--------+-------------+-----------------+

Essentially I have a list of what the location code should be e.g. ["A300","B700","B750","M-200"], but currently the location column is very messy with other random strings. I want to create a new column that shows the "cleaned" version of the location code, and anything that is not in that list should be marked as "not_listed".

Upvotes: 0

Views: 387

Answers (1)

wwnde
wwnde

Reputation: 26686

Use regex and when condition. In this case I check if string begins with a digit ^[0-9] then extract the the leading digits in the string. If it doesn then attribute it with not listed. Code below

df=df.withColumn('Locations_Clean', when(col("Locations").rlike("^[0-9]"),regexp_extract('Locations','^[0-9]+',0)).otherwise(lit('not_listed'))).show()

+--------------------+---------+---------------+
|                Time|Locations|Locations_Clean|
+--------------------+---------+---------------+
|0.045454545454545456|   300abc|            300|
|0.022727272727272728|   300FFF|            300|
| 0.01515151515151515|   300ABC|            300|
|0.011363636363636364|   700abc|            700|
|0.009090909090909092|   750EEE|            750|
|0.007575757575757575|   ABCabc|     not_listed|
+--------------------+---------+---------------+

With your new question, use regexp_replace

df=df.withColumn('Locations_Clean', when(col("Locations").rlike("\d"),regexp_replace('Locations','\-\w+$','')).otherwise(lit('not_listed')))

+------+-----------+---------------+
|  Time|  Locations|Locations_Clean|
+------+-----------+---------------+
|1/1/22|   A300-abc|           A300|
|1/2/22|   A300-FFF|           A300|
|1/3/22|A300-ABC123|           A300|
|1/4/22|   B700-abc|           B700|
|1/5/22|   B750-EEE|           B750|
|1/7/22|   M-200-68|          M-200|
|1/6/22|     ABCabc|     not_listed|
+------+-----------+---------------+

Upvotes: 1

Related Questions