Reputation: 605
I am trying to remove all characters except alpha and spaces from a column, but when i am using the code to perform the same, it gives output as 'nan'
in place of NaN
(Null values)
Input data:
col1
ABC ad
YQW \2
AQ4 GH
@34
#45
NaN
Expected output:
col1
ABC ad
YQW
AQ GH
NaN
NaN
NaN
Code i have been using:
df['col1'] = df['col1'].astype(str).str.extract(r'([A-Za-z]+(?: [A-Za-z]+)*)')
Later i am using this column to check the condition for NaN
but its not giving as after executing the above script it changes the NaN
values to 'nan'
.
Note: without casting to string by .astype(str)
, my data will get
AttributeError: Can only use .str accessor with string values!
Upvotes: 1
Views: 8164
Reputation: 23217
You can do it by the following steps:
NaN
value by empty string (which we may also get after removing characters and will be converted back to NaN
afterwards)..astype(str)
for in case some elements are non-strings in the column.str.replace()
with regexNaN
by .replace()
(Note: The first 2 steps are to special handle for OP's problem of getting AttributeError: Can only use .str accessor with string values!
although my testing of specially adding integer and float (not integer and float in string but real numeric values) also got no problem without the first 2 steps. Maybe some other special data types!?) Other users without the same problem can use only the last 2 steps starting with str.replace()
.
df['col1'] = df['col1'].fillna('').astype(str).str.replace(r'[^A-Za-z ]', '', regex=True).replace('', np.nan, regex=False)
Result:
print(df)
col1
0 ABC ad
1 YQW
2 AQ GH
3 NaN
4 NaN
5 NaN
Note that we cannot use .extract()
here and have to use .replace()
to get rid of the unwanted characters. How about a string like ' ab c1d2@ ef4' ? What regex pattern to use to extract only the alphabets and spaces leaving behind the numbers and special characters ? And don't forget we have to consider the generic cases, not just the sample data here. Can we quote all possible patterns of regex here to handle the infinite numbers of combinations of such alpha, space, number and special character patterns ?
Upvotes: 0
Reputation: 26676
Another way is to extract alphanumerics but exclude numerals. See code below
df['col1']=df['col1'].str.extract('(\w+\s\w+[^0-9]|\w+[^0-9])')
col1
0 ABC ad
1 YQW
2 AQ4 GH
3 NaN
4 NaN
5 NaN
Upvotes: 1