Debayan
Debayan

Reputation: 854

reshape pandas dataframe alongwith some derived columns

I have a dataframe :

Drug_Name   CurrentYear PMPM    Cost
Drug A      201901        25    10
Drug B      201902        25    20
Drug C      201903        50    30
Drug D      202001        75    25
Drug E      202002        100   100

And i want to convert this into :

Drug_Name   Current Year    ComparisionYear Measure_Name    Measure_Value_Current   Measure_Value_Comparision
Drug A  201901  201901  PMPM    25  25
Drug A  201901  201901  Cost    10  10
Drug B  201902  201901  PMPM    25  25
Drug B  201902  201901  Cost    20  10
Drug C  201903  201901  PMPM    50  25
Drug C  201903  201901  Cost    30  10
Drug C  201903  201902  PMPM    50  25
Drug C  201903  201902  Cost    30  20
Drug D  202001  201901  PMPM    75  25
Drug D  202001  201901  Cost    25  10
Drug D  202001  201902  PMPM    75  25
Drug D  202001  201902  Cost    25  20
Drug D  202001  201903  PMPM    75  50
Drug D  202001  201903  Cost    25  30
Drug E  202002  201901  PMPM    100 25
Drug E  202002  201901  Cost    100 10
Drug E  202002  201902  PMPM    100 25
Drug E  202002  201902  Cost    100 20
Drug E  202002  201903  PMPM    100 50
Drug E  202002  201903  Cost    100 30
Drug E  202002  202001  PMPM    100 75
Drug E  202002  202001  Cost    100 25

The idea is not just to pivot but also to add these derived columns for every possible combinations

Upvotes: 0

Views: 25

Answers (1)

Ben Pap
Ben Pap

Reputation: 2579

So we are going to do a 3 step process.

  1. Get all the combo of drugs with all information
  2. Filter it down to what we are looking for
  3. Clean the df (ie col names, reset the index ect)

-- Get all the combo of drugs with all information --

We are going to get all the combo of drugs then save it in a df called allCombo

import itertools
import pandas as pd


allCombo = pd.DataFrame(list(itertools.product(df['Drug_Name'], repeat = 2)), columns = ['Drug1','Drug2'])

Next we are going to make the dtype of the orignial df's drugs a catagory, and ensure that they are ordered. This will allow us to have a significantly easier time filtering when we have to.

We are also going to melt your original df to put PMPM and Cost on rows as you have it in your final df.

df['Drug_Name'] = df['Drug_Name'].astype(pd.api.types.CategoricalDtype(df['Drug_Name'].to_list(), ordered = True))
df= df.melt(['Drug_Name','CurrentYear'], var_name = 'Measure_Name', value_name = 'Measure_Value')

Next we are going to merge our melted original df two times with our allCombo df. After this we have all the information of both drugs we want to compare on each row of the df.

merged = allCombo.merge(df, left_on = 'Drug1', right_on = 'Drug_Name').merge(df, left_on ='Drug2', right_on='Drug_Name')

-- Filter it down to what we are looking for --

Next, we will take our merged df and filter it to only display the rows that we want. You will see here how we used the ordered catagory to help us out here.

filtered = merged[(merged['Drug_Name_x'] < merged['Drug_Name_y'])& (merged['Measure_Name_x']==merged['Measure_Name_y']) |
                 ((merged['Drug_Name_x'] == 'Drug A') &(merged['Drug_Name_y'] == 'Drug A') & (merged['Measure_Name_x']==merged['Measure_Name_y']))].copy()

After the or statement in this filter is specifically due to the fact you want only Drug A compared with Drug A, but not any of the other drugs compared to themselves. We now have the df you are looking for.

-- Clean the df (ie col names, reset the index ect) --

Next is simple dropping of columns, and renaming of columns to match what you wanted.

filtered.drop(columns = ['Drug_Name_x', 'Measure_Name_x'], inplace = True)
filtered.rename(columns = {'Drug_Name_y':'Drug_Name', 
                           'CurrentYear_y':'Current Year',
                           'CurrentYear_x':'ComparisionYear', 
                           'Measure_Name_y':'Measure_Name',
                           'Measure_Value_y':'Measure_Value_Current',
                           'Measure_Value_x':'Measure_Value_Comparision'
                          }, inplace = True)
final = filtered[['Drug_Name', 'Current Year', 'ComparisionYear', 'Measure_Name', 'Measure_Value_Current', 'Measure_Value_Comparision']].reset_index(drop = True)

And now you have your final df!

Upvotes: 1

Related Questions