dorothyZhang
dorothyZhang

Reputation: 47

How to calculate the value of one column based on another column?

I am stuck on how to write a function to calculate the value of one column based on another column.

For example, my dataframe looks like this:

  Name  Total   Bonus
0 Amy   15000   10k
1 Bob   14000   10%
2 Cathy 13400   5.5k
3 David 14800   10% - 20%
4 Emma  15200   8%
5 Fay   13800   0 - 5%
6 Gina  14500   5k - 10k
...

Code for recreating it:

df = pd.DataFrame()
df["Name"] = ["Amy", "Bob", "Cathy", "David", "Emma", "Fay", "Gina"]
df["Total"] = ["15000", "14000", "13400", "14800","15200","13800","14500"]
df["Bonus"] = ["10k", "10%", "5.5k", "10% - 20%", "8%", "0 - 5%", "5k - 10k"]

The dtype of all columns in the dataframe are Object.

What I want is to convert the Bonus column in an organized way:

The result df should look like this:

  Name  Total   Bonus
0 Amy   15000   10000
1 Bob   14000   1400
2 Cathy 13400   5500
3 David 14800   1480
4 Emma  15200   1216
5 Fay   13800   0
6 Gina  14500   5000
...

I am struggling with function. I was able to write a function to convert stringNum to int and do the operations. But I couldn't return to a dafaframe.

Here is my code:

def convertToNumber(df):
    for i in df.index:
        bonus = df.Bonus[i]
        if bonus == "0":
            bonus = 0
        elif bonus.endswith('k'):
            kb = bonus.split("k")
            #print(kb)
            bonus = int(float(kb[0]) * 1000)
        elif bonus.endswith("%"):
            total = int(df.Total[i])
            if len(bonus) > 2:
                b = int(bonus[: 2])
            else:
                b = int(bonus[0])
            bonus = total * b // 100
        else:
            bonus = -1
        print(bonus)

convertToNumber(df)
df

And my result is:

10000
1400
5500
1480
1216
0
5000

   Name     Total   Bonus
0   Amy     15000   10k
1   Bob     14000   10%
2   Cathy   13400   5.5k
3   David   14800   10% - 20%
4   Emma    15200   8%
5   Fay     13800   0 - 5%
6   Gina    14500   5k - 10k

Upvotes: 4

Views: 2208

Answers (2)

nikeros
nikeros

Reputation: 3369

I have solution is based on regular expressions:

import re
import pandas as pd
import numpy as np

p = re.compile("^([\d\.]+).*?([%|k]?)$")

def calc(v, total):
    this_match = p.findall(v)       
    if len(this_match) > 0:
        try:
            this_match = this_match[0]
            if this_match[1] == "k":
                return int(float(this_match[0]) * 1000)
            elif this_match[1] == "%":
                return int(int(total) * float(this_match[0]) / 100)
            else:
                return int(this_match[0])
        except ValueError:
            print(f'Problems in parsing numeric patterns in {v}')
    else:
        print(f'Cannot match pattern {v}')
    return np.nan  

Don't get me wrong: the vectorized solution proposed above is always preferable, especially if you have to work with large frames. For smaller frames, re solution is faster, so I think it really depends on what you have to do.

Upvotes: 0

Shaido
Shaido

Reputation: 28322

First, handle the ranges and select the lower value, then create two boolean masks for k and % separately and then apply all the related logic. For example:

# Handle ranges
df['Bonus'] = df['Bonus'].str.split('-').str[0].str.strip()

# Create boolean masks
ks = df['Bonus'].str.endswith('k')
ps = df['Bonus'].str.endswith('%')

# Remove 'k' and '%' and convert the type to float
df.loc[ks | ps, 'Bonus'] = df.loc[ks | ps, 'Bonus'].str[:-1]
df['Bonus'] = df['Bonus'].astype(float)

# Apply the mask logic and convert to int
df.loc[ks, 'Bonus'] = df.loc[ks, 'Bonus'] * 1000
df.loc[ps, 'Bonus'] = df.loc[ps, 'Total'] * df.loc[ps, 'Bonus'] / 100
df['Bonus'] = df['Bonus'].astype(int)

Result:

    Name  Total  Bonus
0    Amy  15000  10000
1    Bob  14000   1400
2  Cathy  13400   5500
3  David  14800   1480
4   Emma  15200   1216
5    Fay  13800      0
6   Gina  14500   5000

Upvotes: 6

Related Questions