PyNoob
PyNoob

Reputation: 223

What is python regex for removing all text in a column?

I am trying to clean a column:

df:
+-----+------------------+--------------------+--------------------+--------------+--------------+
|     | league           | home_team          | away_team          | home_score   | away_score   |
+=====+==================+====================+====================+==============+==============+
|   0 | Champions League | APOEL              | Qarabag            | 1            | 2            |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   1 | Champions League | FC Copenhagen      | TNS                | 1            | 0            |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   2 | Champions League | AIK                | Maribor            | 3            | 2 ET         |
+-----+------------------+--------------------+--------------------+--------------+--------------+

expected

df:
+-----+------------------+--------------------+--------------------+--------------+--------------+
|     | league           | home_team          | away_team          | home_score   | away_score   |
+=====+==================+====================+====================+==============+==============+
|   0 | Champions League | APOEL              | Qarabag            | 1            | 2            |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   1 | Champions League | FC Copenhagen      | TNS                | 1            | 0            |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   2 | Champions League | AIK                | Maribor            | 3            | 2            |
+-----+------------------+--------------------+--------------------+--------------+--------------+

I am trying

df['away_score'] = df['away_score'].astype(str).str.replace('(\s?\w+)$', '', regex=True)

(works on regex101 but not in pandas)

But all the data in column is being replaced.

+-----+------------------+--------------------+--------------------+--------------+--------------+
|     | league           | home_team          | away_team          | home_score   | away_score   |
+=====+==================+====================+====================+==============+==============+
|   0 | Champions League | APOEL              | Qarabag            | 1            |              |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   1 | Champions League | FC Copenhagen      | TNS                | 1            |              |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   2 | Champions League | AIK                | Maribor            | 3            | 2            |
+-----+------------------+--------------------+--------------------+--------------+--------------+

What should be the correct regex?

Upvotes: 1

Views: 70

Answers (2)

PyNoob
PyNoob

Reputation: 223

I tried this regex, and it worked.

df['away_score'] = df['away_score'].astype(str).str.replace('[a-zA-Z]', '', regex=True)

Upvotes: 2

SeaBean
SeaBean

Reputation: 23217

To clean up the text completely (including space), you should use:

df['away_score'] = df['away_score'].astype(str).str.replace('[a-zA-Z\s]', '', regex=True)

This way, you can also clean up the spaces before the alphabets, e.g. the space before ET in ET.

If you want to clean up not only text but also some non-digit including symbols (leaving only digits), you can use:

df['away_score'] = df['away_score'].astype(str).str.replace('\D', '', regex=True)

Upvotes: 1

Related Questions