Arun
Arun

Reputation: 2478

Comparing multiple pandas DataFrames and columns based on values

I have a problem where I have a bunch of people whose summary has been aggregated to form a table/pd DataFrame. Scores need to be computed for such people depending on their profession and group. Each profession can have different number of groups and is dynamic.

There are 3 pd DFs-

# DF containing summary aggregation for different people-
summary_data = pd.read_csv("Sample_Summary.csv")

# DF containing threshold according to which scores need to be computed-
threshold_data = pd.read_csv("Sample_Threshold.csv")

# DF containing weightage for different columns-
weightage_data = pd.read_csv("Sample_Weightage.csv")

summary_data.shape, threshold_data.shape
# ((12, 6), (8, 6))

weightage_data.shape
# (8, 6)

The flow for the score computation is as follows: for each person in 'summary_data', do: iterate through all columns in 'summary_data' and check the corresponding columns in 'threshold_data' whether the value is greater than or equal to the value in 'threshold_data'. If yes, add the score for that violated column in 'weightage_data'.

summary_data.loc[:, :]
'''
   Profession  Group  Salary  Number_Houses  Number_Cars  Number_Cellphones
0      Doctor      1  300000              2            1                  3
1      Doctor      2  150000              3            2                  1
2      Doctor      3  260000              2            2                  2
3      Doctor      1  120000              1            3                  3
4    Engineer      1   85000              1            1                  1
5    Engineer      2  125000              2            3                  4
6    Engineer      2   60000              3            0                  1
7    Engineer      1   90000              3            3                  3
8    Business      1  266000              2            3                  5
9    Business      2  301000              3            5                  8
10   Business      3  385000              4            1                  1
11   Business      1  183000              2            3                  4
'''

threshold_data.loc[:, :]
'''
  Profession  Group  Salary  Number_Houses  Number_Cars  Number_Cellphones
0     Doctor      1  180000              2            3                  3
1     Doctor      2  120000              3            3                  2
2     Doctor      3  200000              2            3                  3
3   Engineer      1   90000              2            3                  4
4   Engineer      2  120000              3            2                  2
5   Business      1  200000              2            4                  4
6   Business      2  300000              3            6                  6
7   Business      3  400000              5            1                  1
'''

weightage_data.loc[:, :]
'''
  Profession  Group  Salary  Number_Houses  Number_Cars  Number_Cellphones
0     Doctor      1       4              6            5                  2
1     Doctor      2       4              5            5                  2
2     Doctor      3       4              5            5                  1
3   Engineer      1       3              5            3                  1
4   Engineer      2       4              5            3                  2
5   Business      1       5              5            3                  1
6   Business      2       5              5            4                  2
7   Business      3       5              6            4                  2
'''

For example, the two people belonging to 'Doctor' profession and Group = 1,

summary_data[(summary_data['Profession'] == 'Doctor') & (summary_data['Group'] == 1)]
'''
  Profession  Group  Salary  Number_Houses  Number_Cars  Number_Cellphones
0     Doctor      1  300000              2            1                  3
3     Doctor      1  120000              1            3                  3
'''

The corresponding 'threshold_data' & 'weightage_data' are:

threshold_data[(threshold_data['Profession'] == 'Doctor') & (threshold_data['Group'] == 1)]
'''
  Profession  Group  Salary  Number_Houses  Number_Cars  Number_Cellphones
0     Doctor      1  180000              2            3                  3
'''

weightage_data[(weightage_data['Profession'] == 'Doctor') & (weightage_data['Group'] == 1)]
'''
  Profession  Group  Salary  Number_Houses  Number_Cars  Number_Cellphones
0     Doctor      1       4              6            5                  2
'''

The first Doctor has Salary=300000 which is greater than the threshold value=180000, so it crosses the threshold. So, the corresponding weight of 4 (from 'weightage_data') is added, Number_of_houses=2 is greater than or equal to threshold value=2. so the weight of 6 is added to 4. Number_of_Cars=1 is less than threshold value=3, so its not considered. And finally, Number_of_Cellphones=3 is greater than or equal to threshold, so the weight=2 is added. The final score = 4+6+2 = 12.

Similarly, for second Doctor, the score = 5 ('Number_of_Cars' col is violated) + 2 ('Number_of_Cellphones' col is violated)= 7

How can I write code for this without using inefficient and non-pythonic for loops?

Upvotes: 0

Views: 241

Answers (1)

mnist
mnist

Reputation: 6956

This should do the job without any loop:

# define grouping dolumns
grp_vars = ['Profession', 'Group']

# same structure to all data frames by merge
threshold_data_big = pd.merge(summary_data[grp_vars], threshold_data, how='left', on=grp_vars)
weightage_data_big = pd.merge(summary_data[grp_vars], weightage_data, how='left', on=grp_vars)

# compare threshold to acutal data
above_threshold = threshold_data_big.drop(grp_vars, axis=1) <= summary_data.drop(grp_vars, axis=1)

# select those weight values that are above the threshold and calcualte rowsum
summary_data['final_score'] = weightage_data_big.drop(grp_vars, axis=1)[above_threshold].sum(axis=1)
summary_data

enter image description here

Upvotes: 1

Related Questions