Reputation: 49
In a Pandas DataFrame
>> df.head()
A B C
0 1 â#0.00 + "s=?0.07 + 'due0.93 rt@-[ 3.01
1 2 â#0.02 + "s=?0.16 + 'due0.82 rt@-[ 2.97
...
I would like to extract only the numeric values. Column C
I can do with, e.g.,
>> extr = df['C'].str.extract(r'(\d+\.\d+)', expand=False)
>> df['C'] = pd.to_numeric(extr)
>> df.head()
A B C
0 1 â#0.00 + "s=?0.07 + 'due0.93 3.01
1 2 â#0.02 + "s=?0.16 + 'due0.82 2.97
...
but I have problems with the B
column. How can I extract the +
operations, as well as the floats? I tried
>> extr = df['B'].str.extract(r'(\d+\.\d+)\+(\d+\.\d+)\+(\d+\.\d+)', expand=False)
which I was hoping would give me something like
0
0 '0.00+0.07+0.93'
1 '0.02+0.16+0.82'
...
but instead it gives me three columns with NaN
values in them:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
...
So how could I extract the whole arithmetic operations?
(Only the +
operations are needed, and any other characters, such as -
, can be ignored.)
Upvotes: 4
Views: 245
Reputation: 71689
An alternate approach using Series.str.findall
:
df['B'] = df['B'].str.findall(r'(\d+(?:.\d+)?)').agg('+'.join)
# print(df)
A B C
0 1 0.00+0.07+0.93 3.01
1 2 0.02+0.16+0.82 2.97
timeit
comparision of all the solutions:
df.shape
(20000, 4)
%%timeit -n100 @Shubham solution
df['B'].str.findall(r'(\d+(?:.\d+)?)').agg('+'.join)
31.9 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n100 @Rakesh solution
df["B"].str.findall(r"(\d+\.\d+)").str.join("+")
32.7 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n100 @Sammy solution
["+".join(re.findall("(\d+\.?\d+)",entry)) for entry in df.B]
36.8 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n100 @JudV solution
df['B'].str.replace(r'[^\d.+]', '')
59.7 ms ± 5.81 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 2
Reputation: 75890
Python
is not my forte but I'd use replace
instead and do the operation for both columns, maybe look into:
df[['B', 'C']] = df[['B','C']].replace(r'[^\d.+]', '', regex=True)
print(df)
Result:
A B C
0 1 0.00+0.07+0.93 3.01
1 2 0.02+0.16+0.82 2.97
If it's just column B
you are after than maybe simply use:
extr = df['B'].str.replace(r'[^\d.+]', '')
Upvotes: 1
Reputation: 82775
This is one approach using str.findall
& .str.join("+")
Ex:
df = pd.DataFrame({"B": ["""â#0.00 + "s=?0.07 + 'due0.93""", """â#0.02 + "s=?0.16 + 'due0.82"""]})
df["Z"] = df["B"].str.findall(r"(\d+\.\d+)").str.join("+")
print(df)
Output:
B Z
0 â#0.00 + "s=?0.07 + 'due0.93 0.00+0.07+0.93
1 â#0.02 + "s=?0.16 + 'due0.82 0.02+0.16+0.82
Upvotes: 1
Reputation: 28699
One way is to run a str join on the extracted data, using +
as the delimiter
import re
df = pd.read_clipboard(sep='\s{2,}')
df['extract'] = ["+".join(re.findall("(\d+\.?\d+)",entry)) for entry in df.B]
A B C extract
0 1 â#0.00 + "s=?0.07 + 'due0.93 3.01 0.00+0.07+0.93
1 2 â#0.02 + "s=?0.16 + 'due0.82 2.97 0.02+0.16+0.82
Upvotes: 1