Manz
Manz

Reputation: 605

Pandas: How to remove numbers and special characters from a column

I am trying to remove all characters except alpha and spaces from a column, but when i am using the code to perform the same, it gives output as 'nan' in place of NaN (Null values)

Input data:

col1

ABC ad
YQW \2
AQ4 GH
@34
#45
NaN

Expected output:

col1

ABC ad
YQW
AQ GH
NaN
NaN
NaN

Code i have been using:

df['col1'] = df['col1'].astype(str).str.extract(r'([A-Za-z]+(?: [A-Za-z]+)*)')

Later i am using this column to check the condition for NaN but its not giving as after executing the above script it changes the NaN values to 'nan'.

Note: without casting to string by .astype(str), my data will get

AttributeError: Can only use .str accessor with string values!

Upvotes: 1

Views: 8164

Answers (2)

SeaBean
SeaBean

Reputation: 23217

You can do it by the following steps:

  1. Firstly, replace NaN value by empty string (which we may also get after removing characters and will be converted back to NaN afterwards).
  2. Cast the column to string type by .astype(str) for in case some elements are non-strings in the column.
  3. Replace non alpha and non blank to empty string by str.replace() with regex
  4. Finally, replace empty string to NaN by .replace()

(Note: The first 2 steps are to special handle for OP's problem of getting AttributeError: Can only use .str accessor with string values! although my testing of specially adding integer and float (not integer and float in string but real numeric values) also got no problem without the first 2 steps. Maybe some other special data types!?) Other users without the same problem can use only the last 2 steps starting with str.replace().


df['col1'] = df['col1'].fillna('').astype(str).str.replace(r'[^A-Za-z ]', '', regex=True).replace('', np.nan, regex=False)

Result:

print(df)


     col1
0  ABC ad
1    YQW 
2   AQ GH
3     NaN
4     NaN
5     NaN


Note that we cannot use .extract() here and have to use .replace() to get rid of the unwanted characters. How about a string like ' ab c1d2@ ef4' ? What regex pattern to use to extract only the alphabets and spaces leaving behind the numbers and special characters ? And don't forget we have to consider the generic cases, not just the sample data here. Can we quote all possible patterns of regex here to handle the infinite numbers of combinations of such alpha, space, number and special character patterns ?

Upvotes: 0

wwnde
wwnde

Reputation: 26676

Another way is to extract alphanumerics but exclude numerals. See code below

df['col1']=df['col1'].str.extract('(\w+\s\w+[^0-9]|\w+[^0-9])')

    col1
0  ABC ad
1    YQW 
2  AQ4 GH
3     NaN
4     NaN
5     NaN

Upvotes: 1

Related Questions