Reputation: 7448
I am wondering how to use regex remove any non-numeric chars while only selecting non-empty and spaces (a single value may contain one or multiple spaces) values for a series in a more efficient way,
df['numeric_no'] = df['id'].apply(lambda x: re.sub("[^0-9]", "", x))
df = df[(df['numeric_no'] != '') & (df['numeric_no'] != ' ')]
some sample data for the df
numeric_no
B-27000
44-11-E
LAND-11-4
17772A
88LL9A
321LP-3
UNIT 9 CAM -00-12
WWcard_055_34QE
EE119.45
aaa
b b
the result will look like
numeric_no
27000
4411
114
17772
889
3213
90012
05534
119.45
Upvotes: 0
Views: 62
Reputation: 862481
I believe need str.findall
with boolean indexing
:
s = df['numeric_no'].str.findall("(\d*\.\d+|\d+)").str.join('')
s = s[s.astype(bool)]
print (s)
0 27000
1 4411
2 114
3 17772
4 889
5 3213
6 90012
7 05534
8 119.45
Name: numeric_no, dtype: object
Upvotes: 1
Reputation: 153460
I think can try:
df.numeric_no.str.extractall('(\d+?[\.\d+])').astype(str).sum(level=0)
Output:
0
0 2700
1 4411
2 11
3 1777
4 88
5 32
6 0012
7 0534
8 119.45
Upvotes: 1