Reputation: 474
I am a fairly new python user and I am stuck on a problem. Any guidance would be greatly appreciated.
I have a pandas data frame with three columns 'ID', 'Intervention', and 'GradeLevel'. See code below:
data = [[100,'Long', 0], [101,'Short', 1],[102,'Medium', 2],[103,'Long', 0],[104,'Short', 1],[105,'Medium', 2]]
intervention_df = pd.DataFrame(data, columns = ['ID', 'Intervention', 'GradeLevel'])
I then created a dictionary of data frames grouped by 'Intervention'. See code below:
intervention_dict = {Intervention: dfi for Intervention, dfi in df.groupby('Intervention')}
My question is can you loop through the values of the dictionary and manipulate each value of the dictionary? Specifically I am trying to reference a look-up table. The lookup table can be thought of as a roster. My goal is to label anyone in the roster as either 'Yes - Intervention Name' or 'No Intervention'. It becomes tricky because lets say the Long Intervention, for instance, has only GradeLevel 0. That means I would want to tag anyone in the intervention_df with grade level 0 as 'Yes - Long' and anyone not in the intervention_df as 'No - Long' this would become a new column called 'Value'. I would also need to create another variable 'Category' which would specify the intervention name in this example it would simply be 'Long'
lookup_data = [[100, 0], [101, 1],[102, 2],[103, 0],[104, 1],[105, 2], [106, 0], [107, 0],[108, 2],[109, 1]]
lookup_df = pd.DataFrame(lookup_data, columns = ['ID', 'GradeLevel'])
For example the 'Long' dictionary would look like this after the processing:
longint_data = [[100,'Long', 'Yes - Long'],[103,'Long', 'Yes - Long'], [106,'Long', 'No - Long'], [107,'Long', 'No - Long']]
longint_df = pd.DataFrame(longint_data, columns = ['ID','Category', 'Value'])
The desired final output after all manipulation would look like this:
result_data = [[100,'Long', 'Yes - Long'] , [101,'Short','Yes - Short'], [102,'Medium','Yes - Medium'], [103,'Long', 'Yes - Long'], [104,'Short','Yes - Short'] , [105, 'Medium','Yes - Medium'], [106,'Long', 'No - Long'], [107,'Long', 'No - Long'], [108,'Medium','No - Medium'], [109,'Short','No - Short']]
result_df = pd.DataFrame(result_data, columns = ['ID','Category', 'Value'])
Thank you!
Upvotes: 1
Views: 83
Reputation: 25269
Here the solution without using dictionary intervention_dict
. Below is your data which I get from your commands:
In [1048]: intervention_df
Out[1048]:
ID Intervention GradeLevel
0 100 Long 0
1 101 Short 1
2 102 Medium 2
3 103 Long 0
4 104 Short 1
5 105 Medium 2
In [1049]: lookup_df
Out[1049]:
ID GradeLevel
0 100 0
1 101 1
2 102 2
3 103 0
4 104 1
5 105 2
6 106 0
7 107 0
8 108 2
9 109 1
Step 1: Doing outer merge between lookup_df
and intervention_df
, create column Value
and set_index
to GradeLevel
In [1059]: df = lookup_df.merge(intervention_df, on=['ID', 'GradeLevel'], how='outer').assign(Value='Yes - '+intervention_df['Intervention']).set_index('GradeLevel')
In [1060]: df
Out[1060]:
ID Intervention Value
GradeLevel
0 100 Long Yes - Long
1 101 Short Yes - Short
2 102 Medium Yes - Medium
0 103 Long Yes - Long
1 104 Short Yes - Short
2 105 Medium Yes - Medium
0 106 NaN NaN
0 107 NaN NaN
2 108 NaN NaN
1 109 NaN NaN
Step2: create df_fillna
to fill NaN
in df
In [1063]: df_fillna = intervention_df.groupby('Intervention').head(1).assign(Value='No - '+intervention_df['Intervention']).set_index('GradeLevel')
In [1064]: df_fillna
Out[1064]:
ID Intervention Value
GradeLevel
0 100 Long No - Long
1 101 Short No - Short
2 102 Medium No - Medium
Step 3 (final): using combine_first
to fill NaN
in df
from df_fillna
values and reset_index
to delete 'GradeLeveland doing
sort_valueson
ID`
In [1068]: df.combine_first(df_fillna).sort_values('ID').reset_index(drop=True)
Out[1068]:
ID Intervention Value
0 100 Long Yes - Long
1 101 Short Yes - Short
2 102 Medium Yes - Medium
3 103 Long Yes - Long
4 104 Short Yes - Short
5 105 Medium Yes - Medium
6 106 Long No - Long
7 107 Long No - Long
8 108 Medium No - Medium
9 109 Short No - Short
Upvotes: 1
Reputation: 3722
This is what I feel like you're going for.. but without more clear explanation, I"m not sure.
data = [[100,'Long', 0], [101,'Short', 1],[102,'Medium', 2],[103,'Long', 0],[104,'Short', 1],[105,'Medium', 2]]
intervention_df = pd.DataFrame(data, columns = ['ID', 'Intervention', 'GradeLevel'])
lookup_data = [[100, 0], [101, 1],[102, 2],[103, 0],[104, 1],[105, 2], [106, 0], [107, 0],[108, 2],[109, 1]]
lookup_df = pd.DataFrame(lookup_data, columns = ['ID', 'GradeLevel'])
df= pd.merge(intervention_df.assign(y='Yes'), lookup_df, on=['ID', 'GradeLevel'], how='outer')
df.loc[df.y.isnull(), 'y'] = 'No'
ID Intervention GradeLevel y
0 100 Long 0 Yes
1 101 Short 1 Yes
2 102 Medium 2 Yes
3 103 Long 0 Yes
4 104 Short 1 Yes
5 105 Medium 2 Yes
6 106 NaN 0 No
7 107 NaN 0 No
8 108 NaN 2 No
9 109 NaN 1 No
Upvotes: 2