Reputation: 223
I am trying to clean a column:
df:
+-----+------------------+--------------------+--------------------+--------------+--------------+
| | league | home_team | away_team | home_score | away_score |
+=====+==================+====================+====================+==============+==============+
| 0 | Champions League | APOEL | Qarabag | 1 | 2 |
+-----+------------------+--------------------+--------------------+--------------+--------------+
| 1 | Champions League | FC Copenhagen | TNS | 1 | 0 |
+-----+------------------+--------------------+--------------------+--------------+--------------+
| 2 | Champions League | AIK | Maribor | 3 | 2 ET |
+-----+------------------+--------------------+--------------------+--------------+--------------+
expected
df:
+-----+------------------+--------------------+--------------------+--------------+--------------+
| | league | home_team | away_team | home_score | away_score |
+=====+==================+====================+====================+==============+==============+
| 0 | Champions League | APOEL | Qarabag | 1 | 2 |
+-----+------------------+--------------------+--------------------+--------------+--------------+
| 1 | Champions League | FC Copenhagen | TNS | 1 | 0 |
+-----+------------------+--------------------+--------------------+--------------+--------------+
| 2 | Champions League | AIK | Maribor | 3 | 2 |
+-----+------------------+--------------------+--------------------+--------------+--------------+
I am trying
df['away_score'] = df['away_score'].astype(str).str.replace('(\s?\w+)$', '', regex=True)
(works on regex101 but not in pandas)
But all the data in column is being replaced.
+-----+------------------+--------------------+--------------------+--------------+--------------+
| | league | home_team | away_team | home_score | away_score |
+=====+==================+====================+====================+==============+==============+
| 0 | Champions League | APOEL | Qarabag | 1 | |
+-----+------------------+--------------------+--------------------+--------------+--------------+
| 1 | Champions League | FC Copenhagen | TNS | 1 | |
+-----+------------------+--------------------+--------------------+--------------+--------------+
| 2 | Champions League | AIK | Maribor | 3 | 2 |
+-----+------------------+--------------------+--------------------+--------------+--------------+
What should be the correct regex?
Upvotes: 1
Views: 70
Reputation: 223
I tried this regex, and it worked.
df['away_score'] = df['away_score'].astype(str).str.replace('[a-zA-Z]', '', regex=True)
Upvotes: 2
Reputation: 23217
To clean up the text completely (including space), you should use:
df['away_score'] = df['away_score'].astype(str).str.replace('[a-zA-Z\s]', '', regex=True)
This way, you can also clean up the spaces before the alphabets, e.g. the space before ET
in ET
.
If you want to clean up not only text but also some non-digit including symbols (leaving only digits), you can use:
df['away_score'] = df['away_score'].astype(str).str.replace('\D', '', regex=True)
Upvotes: 1