Reputation: 53
I have a dataframe with a column of strings that are a mix of whole numbers and mixed fractions. I would like to convert column 'y' to floats.
x y z
0 4 Info
1 8 1/2 Info
2 3/4 Info
3 10 Info
4 4 Info
5 6 1/4 Info
The logic I am considering is to split column 'y' by ' ' and '/' to create three separate columns that would look like this.
x base b c z
0 4 0 0 Info
1 8 1 2 Info
2 0 3 4 Info
3 10 0 0 Info
4 4 0 0 Info
5 6 1 4 Info
From here I could
def convertReplace(df):
convert = lambda x: float(x)
df['base'].apply(convert)
df['b'].apply(convert)
df['c'].apply(convert)
decimal = lambda x,y: x/y
try:
df['d'] = decimal(df['b'],df['c'])
df['y'] = df['base'] + df['d']
except:
df['y'] = df['base']
return df
This might work but I can't get the column to split using the method found here.
df = pd.DataFrame(df.y.str.split(' ',1).str.split('/',1).tolist(),columns = ['base','b','c'])
The error says it expects 3 arguments each time when it may be 1, 2, or 3. Even this thread doesn't use multiple separators.
The actual dataframe has over 400k rows. Efficiency would be great but I'm more interested in just getting it done. Is this logic correct or is there a more concise way to do this? Any help is appreciated.
Upvotes: 2
Views: 953
Reputation: 7179
You could try the fractions module. Here's a one-liner:
import fractions
df['y_float'] = df['y'].apply(lambda frac: float(sum([fractions.Fraction(x) for x in frac.split()])))
This gives:
y z y_float
0 4 Info 4.00
1 8 1/2 Info 8.50
2 3/4 Info 0.75
3 10 Info 10.00
4 4 Info 4.00
5 6 1/4 Info 6.25
[EDIT] Corrected version accounting for negative fractions, as well as invalid text:
I realized the above approach would not work for negative fractions, so here is that taken in to account. As it turns out, a one-liner for this would be very tricky!
def get_sign(num_str):
"""
Verify the sign of the fraction
"""
return 1-2*num_str.startswith('-')
def is_valid_fraction(text_str):
"""
Check if the string provided is a valid fraction.
Here I just used a quick example to check for something of the form of the fraction you have. For something more robust based on what your data can potentially contain, a regex approach would be better.
"""
return text_str.replace(' ', '').replace('-', '').replace('/', '').isdigit()
def convert_to_float(text_str):
"""
Convert an incoming string to a float if it is a fraction
"""
if is_valid_fraction(text_str):
sgn = get_sign(text_str)
return sgn*float(sum([abs(fractions.Fraction(x)) for x in text_str.split()]))
else:
return pd.np.nan # Insert a NaN if it is invalid text
So now you will have this:
>>> df['y_float'] = df['y'].apply(lambda frac: convert_to_float(frac))
>>> df
y z y_float
0 4 Info 4.00
1 8 1/2 Info 8.50
2 3/4 Info 0.75
3 10 Info 10.00
4 0 Info 0.00
5 6 1/4 Info 6.25
6 -3 2/5 Info -3.40
7 -4/5 Info -0.80
8 gibberish100 Info NaN
Upvotes: 2