Michail N
Michail N

Reputation: 3835

Apply a regex function on a pandas dataframe

I have a dataframe in pandas like:

0                       1                   2
([0.8898668778942382    0.89533945283595]   0)
([1.2632564814188714    1.0207660696232244] 0)
([1.006649166957976     1.1180973832359227] 0)
([0.9653632916751714    0.8625538463644129] 0)
([1.038366333873932     0.9091449796555554] 0)

All values are strings. I want to remove all special characters and convert to double. I want to apply a function that remove all special character excepr the dot like

import re
re.sub('[^0-9.]+', '',x)

so I want to apply this in all cell of the dataframe. How can I do it? I find df.applymap function but I don't know how to pass the string as argument. I tried

def remSp(x): 
    re.sub('^[0-9]+', '',x)

df.applymap(remSp())

but I don't know how to pass the cells to the function. Is there a better way to do it?

Thank you

Upvotes: 5

Views: 10017

Answers (3)

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

Why cant use the default replace method on df directly with regex i.e

df = df.replace('[^\d.]', '',regex=True).astype(float)
          0         1    2
0  0.889867  0.895339  0.0
1  1.263256  1.020766  0.0
2  1.006649  1.118097  0.0
3  0.965363  0.862554  0.0
4  1.038366  0.909145  0.0

Which is still faster than the other answers.

Upvotes: 6

cs95
cs95

Reputation: 402263

Iterate over columns, call str.replace.

for c in df.columns:
    df[c] = df[c].str.replace('[^\d.]', '')

df = df.astype(float)
df
          0         1  2
0  0.889867  0.895339  0
1  1.263256  1.020766  0
2  1.006649  1.118097  0
3  0.965363  0.862554  0
4  1.038366  0.909145  0

Unfortunately, pandas does not yet support string accessor operations on the dataframe as a whole, so the alternative to looping over columns would be something slower like a lambdised applymap/transform.


Performance

Small

100 loops, best of 3: 2.04 ms per loop  # applymap 
100 loops, best of 3: 2.69 ms per loop  # transform
1000 loops, best of 3: 1.45 ms per loop  # looped str.replace

Large (df * 10000)

1 loop, best of 3: 618 ms per loop  # applymap 
1 loop, best of 3: 658 ms per loop  # transform
1 loop, best of 3: 341 ms per loop  # looped str.replace
1 loop, best of 3: 212 ms per loop  # df.replace

Upvotes: 2

Zero
Zero

Reputation: 76917

Using applymap

In [814]: df.applymap(lambda x: re.sub(r'[^\d.]+', '', x)).astype(float)
Out[814]:
          0         1    2
0  0.889867  0.895339  0.0
1  1.263256  1.020766  0.0
2  1.006649  1.118097  0.0
3  0.965363  0.862554  0.0
4  1.038366  0.909145  0.0

Using transform

In [809]: df.transform(lambda x: x.str.replace(r'[^\d.]+', '')).astype(float)
Out[809]:
          0         1    2
0  0.889867  0.895339  0.0
1  1.263256  1.020766  0.0
2  1.006649  1.118097  0.0
3  0.965363  0.862554  0.0
4  1.038366  0.909145  0.0

Upvotes: 3

Related Questions