bluetooth
bluetooth

Reputation: 559

Pandas - Sorting by 2 columns and comparing the values in the other column

Please consider this data Frame:

pd.DataFrame({
'REGION':['US','US','CAN','CAN', 'EU','EU','EU'],
'ROLE': ['mgr','dir','mgr','dir','mgr','dir','CEO'],
'SALARY' : [4,5,3.7,6,4.1,5.5,8],
'other_columns':['random_val1','random_val2','random_val3','random_val4','random_val5','random_val6','random_val7']

})

In this data frame, we have two regions, and in each region multiple employee roles. The salary column contains salary for that role in that region. Assume that all salary numbers are have the same currency.

Now, I would like to make sure that for any ROLE, the salary in CAN region must be at least as much as that in the US - and the salary in EU must be at least as much as that in CAN.

How do I solve it so that I get the following data frame?

pd.DataFrame({
'REGION':['US','US','CAN','CAN', 'EU','EU','EU'],
'ROLE': ['mgr','dir','mgr','dir','mgr','dir','CEO'],
'SALARY' : [4,5,4,6,4.1,6,8],
'other_columns':['random_val1','random_val2','random_val3','random_val4','random_val5','random_val6','random_val7']

})

Please note that this is a sample data frame - in the real data frame, I have a few additional columns that I would like to keep unchanged. Thanks!

Upvotes: 1

Views: 60

Answers (3)

rchome
rchome

Reputation: 2723

Another solution using groupby and cummax. I like this method because you can extend the number of regions you need to support relatively easily by adding additional regions to the custom sorting order.

df = pd.DataFrame({
'REGION':['US','US','CAN','CAN', 'EU','EU','EU'],
'ROLE': ['mgr','dir','mgr','dir','mgr','dir','CEO'],
'SALARY' : [4,5,3.7,6,4.1,5.5,8],
'other_columns':['random_val1','random_val2','random_val3','random_val4','random_val5','random_val6','random_val7']})

# Replace the region with a categorical variable to ensure sorting order is US, CAN, EU
df["REGION"] = pd.Categorical(df["REGION"], ["US", "CAN", "EU"])
df = df.sort_values(["ROLE", "REGION"])
df = df.groupby("ROLE").apply(lambda x: x.assign(SALARY=x["SALARY"].cummax()))
# if you need your data in the original order again
df = df.sort_index()

Upvotes: 1

jezrael
jezrael

Reputation: 862611

Solution by mapping and slicing in MultiIndex, for set values i use Series.clip:

df = df.set_index(['REGION','ROLE'])
df1 = df.copy()

us = df.loc['US', 'SALARY']
can = df.loc['CAN', 'SALARY']
eu = df.loc['EU', 'SALARY']

df.loc['CAN', 'SALARY'] = can.clip(lower=can.index.map(us)).to_numpy()
df.loc['EU', 'SALARY'] = eu.clip(lower=eu.index.map(can)).to_numpy()
df = df.fillna(df1).reset_index()
print (df)
  REGION ROLE  SALARY other_columns
0     US  mgr     4.0   random_val1
1     US  dir     5.0   random_val2
2    CAN  mgr     4.0   random_val3
3    CAN  dir     6.0   random_val4
4     EU  mgr     4.1   random_val5
5     EU  dir     6.0   random_val6
6     EU  CEO     8.0   random_val7

Another solution woth pivoting and unpivot:

df1 = df.pivot('ROLE','REGION','SALARY')
df1['CAN'] = df1[['CAN','US']].max(axis=1)
df1['EU'] = df1[['CAN','EU']].max(axis=1)
   
df = df.join(df1.stack().rename('new'), on=['ROLE','REGION'])
df['SALARY'] = df.pop('new')
print (df)
  REGION ROLE  SALARY other_columns
0     US  mgr     4.0   random_val1
1     US  dir     5.0   random_val2
2    CAN  mgr     4.0   random_val3
3    CAN  dir     6.0   random_val4
4     EU  mgr     4.1   random_val5
5     EU  dir     6.0   random_val6
6     EU  CEO     8.0   random_val7

Upvotes: 0

tangkikodo
tangkikodo

Reputation: 1

import pandas as pd

data = pd.DataFrame({
'region':['US','US','CAN','CAN', 'EU','EU','EU'],
'role': ['mgr','dir','mgr','dir','mgr','dir','CEO'],
'salary' : [4,5,3.7,6,4.1,5.5,8],
'other_columns':['random_val1','random_val2','random_val3','random_val4','random_val5','random_val6','random_val7']})

pt = pd.pivot_table(data, values=['salary'], index=['role'], columns=['region'])

df = pt['salary'].fillna(0)

df['CAN'] = df.apply(lambda x: max(x['US'], x['CAN']), axis=1)
df['EU'] = df.apply(lambda x: max(x['CAN'], x['EU']), axis=1)

data['salary'] = data.apply(lambda x: df[x['region']][x['role']], axis=1)

print(data)

Upvotes: 0

Related Questions