Hessu
Hessu

Reputation: 49

How to extract an arithmetic operation from a string with Pandas

In a Pandas DataFrame

>> df.head()

    A                               B             C
0   1    â#0.00 + "s=?0.07 + 'due0.93    rt@-[ 3.01
1   2    â#0.02 + "s=?0.16 + 'due0.82    rt@-[ 2.97
...

I would like to extract only the numeric values. Column C I can do with, e.g.,

>> extr = df['C'].str.extract(r'(\d+\.\d+)', expand=False)
>> df['C'] = pd.to_numeric(extr)
>> df.head()

    A                               B       C
0   1    â#0.00 + "s=?0.07 + 'due0.93    3.01
1   2    â#0.02 + "s=?0.16 + 'due0.82    2.97
...

but I have problems with the B column. How can I extract the + operations, as well as the floats? I tried

>> extr = df['B'].str.extract(r'(\d+\.\d+)\+(\d+\.\d+)\+(\d+\.\d+)', expand=False)

which I was hoping would give me something like

                    0
0    '0.00+0.07+0.93'
1    '0.02+0.16+0.82'
...

but instead it gives me three columns with NaN values in them:

       0      1      2
0    NaN    NaN    NaN
1    NaN    NaN    NaN
...

So how could I extract the whole arithmetic operations?

(Only the + operations are needed, and any other characters, such as -, can be ignored.)

Upvotes: 4

Views: 245

Answers (4)

Shubham Sharma
Shubham Sharma

Reputation: 71689

An alternate approach using Series.str.findall:

df['B'] = df['B'].str.findall(r'(\d+(?:.\d+)?)').agg('+'.join)

# print(df)
   A               B     C
0  1  0.00+0.07+0.93  3.01
1  2  0.02+0.16+0.82  2.97

timeit comparision of all the solutions:

df.shape
(20000, 4)

%%timeit -n100 @Shubham solution
df['B'].str.findall(r'(\d+(?:.\d+)?)').agg('+'.join)
31.9 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit -n100 @Rakesh solution
df["B"].str.findall(r"(\d+\.\d+)").str.join("+")
32.7 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit -n100 @Sammy solution
["+".join(re.findall("(\d+\.?\d+)",entry)) for entry in df.B]
36.8 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit -n100 @JudV solution
df['B'].str.replace(r'[^\d.+]', '')
59.7 ms ± 5.81 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 2

JvdV
JvdV

Reputation: 75890

Python is not my forte but I'd use replace instead and do the operation for both columns, maybe look into:

df[['B', 'C']] = df[['B','C']].replace(r'[^\d.+]', '', regex=True)
print(df)

Result:

   A               B     C
0  1  0.00+0.07+0.93  3.01
1  2  0.02+0.16+0.82  2.97

If it's just column B you are after than maybe simply use:

extr = df['B'].str.replace(r'[^\d.+]', '')

Upvotes: 1

Rakesh
Rakesh

Reputation: 82775

This is one approach using str.findall & .str.join("+")

Ex:

df = pd.DataFrame({"B": ["""â#0.00 + "s=?0.07 + 'due0.93""", """â#0.02 + "s=?0.16 + 'due0.82"""]})
df["Z"] = df["B"].str.findall(r"(\d+\.\d+)").str.join("+")

print(df)

Output:

                              B               Z
0  â#0.00 + "s=?0.07 + 'due0.93  0.00+0.07+0.93
1  â#0.02 + "s=?0.16 + 'due0.82  0.02+0.16+0.82

Upvotes: 1

sammywemmy
sammywemmy

Reputation: 28699

One way is to run a str join on the extracted data, using + as the delimiter

import re

df = pd.read_clipboard(sep='\s{2,}')


df['extract'] = ["+".join(re.findall("(\d+\.?\d+)",entry)) for entry in df.B]


    A                 B                  C         extract
0   1   â#0.00 + "s=?0.07 + 'due0.93    3.01    0.00+0.07+0.93
1   2   â#0.02 + "s=?0.16 + 'due0.82    2.97    0.02+0.16+0.82

Upvotes: 1

Related Questions