Reputation: 47
I am stuck on how to write a function to calculate the value of one column based on another column.
For example, my dataframe looks like this:
Name Total Bonus
0 Amy 15000 10k
1 Bob 14000 10%
2 Cathy 13400 5.5k
3 David 14800 10% - 20%
4 Emma 15200 8%
5 Fay 13800 0 - 5%
6 Gina 14500 5k - 10k
...
Code for recreating it:
df = pd.DataFrame()
df["Name"] = ["Amy", "Bob", "Cathy", "David", "Emma", "Fay", "Gina"]
df["Total"] = ["15000", "14000", "13400", "14800","15200","13800","14500"]
df["Bonus"] = ["10k", "10%", "5.5k", "10% - 20%", "8%", "0 - 5%", "5k - 10k"]
The dtype
of all columns in the dataframe are Object
.
What I want is to convert the Bonus column in an organized way:
The result df should look like this:
Name Total Bonus
0 Amy 15000 10000
1 Bob 14000 1400
2 Cathy 13400 5500
3 David 14800 1480
4 Emma 15200 1216
5 Fay 13800 0
6 Gina 14500 5000
...
I am struggling with function. I was able to write a function to convert stringNum to int and do the operations. But I couldn't return to a dafaframe.
Here is my code:
def convertToNumber(df):
for i in df.index:
bonus = df.Bonus[i]
if bonus == "0":
bonus = 0
elif bonus.endswith('k'):
kb = bonus.split("k")
#print(kb)
bonus = int(float(kb[0]) * 1000)
elif bonus.endswith("%"):
total = int(df.Total[i])
if len(bonus) > 2:
b = int(bonus[: 2])
else:
b = int(bonus[0])
bonus = total * b // 100
else:
bonus = -1
print(bonus)
convertToNumber(df)
df
And my result is:
10000
1400
5500
1480
1216
0
5000
Name Total Bonus
0 Amy 15000 10k
1 Bob 14000 10%
2 Cathy 13400 5.5k
3 David 14800 10% - 20%
4 Emma 15200 8%
5 Fay 13800 0 - 5%
6 Gina 14500 5k - 10k
Upvotes: 4
Views: 2208
Reputation: 3369
I have solution is based on regular expressions:
import re
import pandas as pd
import numpy as np
p = re.compile("^([\d\.]+).*?([%|k]?)$")
def calc(v, total):
this_match = p.findall(v)
if len(this_match) > 0:
try:
this_match = this_match[0]
if this_match[1] == "k":
return int(float(this_match[0]) * 1000)
elif this_match[1] == "%":
return int(int(total) * float(this_match[0]) / 100)
else:
return int(this_match[0])
except ValueError:
print(f'Problems in parsing numeric patterns in {v}')
else:
print(f'Cannot match pattern {v}')
return np.nan
Don't get me wrong: the vectorized solution proposed above is always preferable, especially if you have to work with large frames.
For smaller frames, re
solution is faster, so I think it really depends on what you have to do.
Upvotes: 0
Reputation: 28322
First, handle the ranges and select the lower value, then create two boolean masks for k
and %
separately and then apply all the related logic. For example:
# Handle ranges
df['Bonus'] = df['Bonus'].str.split('-').str[0].str.strip()
# Create boolean masks
ks = df['Bonus'].str.endswith('k')
ps = df['Bonus'].str.endswith('%')
# Remove 'k' and '%' and convert the type to float
df.loc[ks | ps, 'Bonus'] = df.loc[ks | ps, 'Bonus'].str[:-1]
df['Bonus'] = df['Bonus'].astype(float)
# Apply the mask logic and convert to int
df.loc[ks, 'Bonus'] = df.loc[ks, 'Bonus'] * 1000
df.loc[ps, 'Bonus'] = df.loc[ps, 'Total'] * df.loc[ps, 'Bonus'] / 100
df['Bonus'] = df['Bonus'].astype(int)
Result:
Name Total Bonus
0 Amy 15000 10000
1 Bob 14000 1400
2 Cathy 13400 5500
3 David 14800 1480
4 Emma 15200 1216
5 Fay 13800 0
6 Gina 14500 5000
Upvotes: 6