user3046211
user3046211

Reputation: 696

Removing unwanted characters in python pandas

I have a pandas dataframe column like below :

  |  ColumnA    |
  +-------------+
  |    ABCD(!)  |
  | <DEFG>(23)  |
  | (MNPQ. )    |
  |    32.JHGF  |
  |    "QWERT"  |

Aim is to remove the special characters and produce the output as below :

  |  ColumnA   |
  +------------+
  |    ABCD    |
  |    DEFG    |
  |    MNPQ    |
  |    JHGF    |
  |    QWERT   |

Tried using the replace method like below, but without success :

df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z\d\_]+", "", regex=True)
print(df)

So, how can I replace the special characters using replace method in pandas?

Upvotes: 0

Views: 271

Answers (3)

Slybot
Slybot

Reputation: 598

Your suggested code works fine on my installation with only extra digits so that you need to update your regex statement: r"[^a-zA-Z]+" If this doesn't work, then maybe try to update your pandas;

import pandas as pd

d = {'Column A': [' ABCD(!)', '<DEFG>(23)', '(MNPQ. )', ' 32.JHGF',  '"QWERT"']}
df = pd.DataFrame(d)
df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z]+", "", regex=True)
print(df)

Output

  Column A
0     ABCD
1     DEFG
2     MNPQ
3     JHGF
4    QWERT

Upvotes: 0

Jerry An
Jerry An

Reputation: 1432

regrex should be r'[^a-zA-Z]+', it means keep only the characters that are from A to Z, a-z

import pandas as pd

#   |  ColumnA    |
#   +-------------+
#   |    ABCD(!)  |
#   | <DEFG>(23)  |
#   | (MNPQ. )    |
#   |    32.JHGF  |
#   |    "QWERT"  |

# create a dataframe from a list 
df = pd.DataFrame(['ABCD(!)', 'DEFG(23)', '(MNPQ. )', '32.JHGF', 'QWERT'], columns=['ColumnA'])

 #   |  ColumnA   |
#   +------------+
#   |    ABCD    |
#   |    DEFG    |
#   |    MNPQ    |
#   |    JHGF    |
#   |    QWERT   |
# keep only the characters that are from A to Z, a-z 
df['ColumnB'] =df['ColumnA'].str.replace(r'[^a-zA-Z]+', '')

print(df['ColumnB'])

Result:

0     ABCD
1     DEFG
2     MNPQ
3     JHGF
4    QWERT

Upvotes: 1

jezrael
jezrael

Reputation: 863031

Your solution is also for get numbers \d and _, so it remove only:

df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z]+", "")
print (df)
  ColumnA
0    ABCD
1    DEFG
2    MNPQ
3    JHGF
4   QWERT

Upvotes: 1

Related Questions