Reputation: 3
Example of Dataframe My Pandas dataframe has a column EvaRange which is captured in the following way.
<1000 mm
1000-1200mm
1200-1400mm
>1400mm
Desired Output I want to perform some Machine Learning on the dataframe so I need to convert this into a single numerical value.
So far I have managed to do this for a single row in the dataframe but I want to apply it to the entire column.
Code Example
a = df["EvaRange"][0].strip().split('mm')[0].split('-')
b = (float(a[0])+float(a[1]))/2
b
This manages to return an averaged value between the two ranges where 2 numbers are available.
Request Please could someone assist me with generalizing this so that I can apply it to the entire column and accomodate for the "<" and ">" values.
Upvotes: 0
Views: 948
Reputation: 28679
I would suggest using str.extractall to get all the numbers, then get the mean on the first level:
df.EvaRange.str.extractall(r"(\d+)").astype(float).mean(level=0)
0
0 1000.0
1 1100.0
2 1300.0
3 1400.0
Building on your idea of strip and split:
(df.EvaRange
.str.strip("<> mm")
.str.split("-")
.explode()
.astype(float)
.mean(level=0)
)
0 1000.0
1 1100.0
2 1300.0
3 1400.0
Name: EvaRange, dtype: float64
Upvotes: 0
Reputation: 57033
I would recommend extracting numbers and then averaging them:
df["EvaRange"].str.extract(r"(\d+)\D*(\d+)?").astype(float).mean(axis=1)
#0 1000.0
#1 1100.0
#2 1300.0
#3 1400.0
Here, the regular expression r"(\d+)\D*(\d+)?"
asks for one or more digits (a number), optionally followed by some non-digits, optionally followed by some more digits (another number).
Upvotes: 2