Reputation: 2478
I have a problem where I have a bunch of people whose summary has been aggregated to form a table/pd DataFrame. Scores need to be computed for such people depending on their profession and group. Each profession can have different number of groups and is dynamic.
There are 3 pd DFs-
# DF containing summary aggregation for different people-
summary_data = pd.read_csv("Sample_Summary.csv")
# DF containing threshold according to which scores need to be computed-
threshold_data = pd.read_csv("Sample_Threshold.csv")
# DF containing weightage for different columns-
weightage_data = pd.read_csv("Sample_Weightage.csv")
summary_data.shape, threshold_data.shape
# ((12, 6), (8, 6))
weightage_data.shape
# (8, 6)
The flow for the score computation is as follows: for each person in 'summary_data', do: iterate through all columns in 'summary_data' and check the corresponding columns in 'threshold_data' whether the value is greater than or equal to the value in 'threshold_data'. If yes, add the score for that violated column in 'weightage_data'.
summary_data.loc[:, :]
'''
Profession Group Salary Number_Houses Number_Cars Number_Cellphones
0 Doctor 1 300000 2 1 3
1 Doctor 2 150000 3 2 1
2 Doctor 3 260000 2 2 2
3 Doctor 1 120000 1 3 3
4 Engineer 1 85000 1 1 1
5 Engineer 2 125000 2 3 4
6 Engineer 2 60000 3 0 1
7 Engineer 1 90000 3 3 3
8 Business 1 266000 2 3 5
9 Business 2 301000 3 5 8
10 Business 3 385000 4 1 1
11 Business 1 183000 2 3 4
'''
threshold_data.loc[:, :]
'''
Profession Group Salary Number_Houses Number_Cars Number_Cellphones
0 Doctor 1 180000 2 3 3
1 Doctor 2 120000 3 3 2
2 Doctor 3 200000 2 3 3
3 Engineer 1 90000 2 3 4
4 Engineer 2 120000 3 2 2
5 Business 1 200000 2 4 4
6 Business 2 300000 3 6 6
7 Business 3 400000 5 1 1
'''
weightage_data.loc[:, :]
'''
Profession Group Salary Number_Houses Number_Cars Number_Cellphones
0 Doctor 1 4 6 5 2
1 Doctor 2 4 5 5 2
2 Doctor 3 4 5 5 1
3 Engineer 1 3 5 3 1
4 Engineer 2 4 5 3 2
5 Business 1 5 5 3 1
6 Business 2 5 5 4 2
7 Business 3 5 6 4 2
'''
For example, the two people belonging to 'Doctor' profession and Group = 1,
summary_data[(summary_data['Profession'] == 'Doctor') & (summary_data['Group'] == 1)]
'''
Profession Group Salary Number_Houses Number_Cars Number_Cellphones
0 Doctor 1 300000 2 1 3
3 Doctor 1 120000 1 3 3
'''
The corresponding 'threshold_data' & 'weightage_data' are:
threshold_data[(threshold_data['Profession'] == 'Doctor') & (threshold_data['Group'] == 1)]
'''
Profession Group Salary Number_Houses Number_Cars Number_Cellphones
0 Doctor 1 180000 2 3 3
'''
weightage_data[(weightage_data['Profession'] == 'Doctor') & (weightage_data['Group'] == 1)]
'''
Profession Group Salary Number_Houses Number_Cars Number_Cellphones
0 Doctor 1 4 6 5 2
'''
The first Doctor has Salary=300000 which is greater than the threshold value=180000, so it crosses the threshold. So, the corresponding weight of 4 (from 'weightage_data') is added, Number_of_houses=2 is greater than or equal to threshold value=2. so the weight of 6 is added to 4. Number_of_Cars=1 is less than threshold value=3, so its not considered. And finally, Number_of_Cellphones=3 is greater than or equal to threshold, so the weight=2 is added. The final score = 4+6+2 = 12.
Similarly, for second Doctor, the score = 5 ('Number_of_Cars' col is violated) + 2 ('Number_of_Cellphones' col is violated)= 7
How can I write code for this without using inefficient and non-pythonic for loops?
Upvotes: 0
Views: 241
Reputation: 6956
This should do the job without any loop:
# define grouping dolumns
grp_vars = ['Profession', 'Group']
# same structure to all data frames by merge
threshold_data_big = pd.merge(summary_data[grp_vars], threshold_data, how='left', on=grp_vars)
weightage_data_big = pd.merge(summary_data[grp_vars], weightage_data, how='left', on=grp_vars)
# compare threshold to acutal data
above_threshold = threshold_data_big.drop(grp_vars, axis=1) <= summary_data.drop(grp_vars, axis=1)
# select those weight values that are above the threshold and calcualte rowsum
summary_data['final_score'] = weightage_data_big.drop(grp_vars, axis=1)[above_threshold].sum(axis=1)
summary_data
Upvotes: 1