Reputation: 4723
I have a drug reference as a df called df_drug_ref
(below). There are three drugs (A, B, and C). The corresponding ATC is listed in the second column. However, if patient has a DIN within the Drug_BIN_Id_Exclusion
list, then s/he would not be considered as using that drug (ie. 011235 for Drug A).
Drug Drug_ATC_Id Drug_DIN_Id_Exclusion
A N123 [011235]
B B5234 [65413, 654351]
C N32456 []
The following is the other df called df_row. This captures all the drugs dispensed by each individual. And each individual has his own People_Id
.
People_Id Drug_ATC Drug_DIN A B C
1001 N123
1001 N123 011235
1001 N32456 011232
1001 N111
1002 B5234 65413
1002 B5234 654090
1002 N123 011235
I would like to assign '1' for the corresponding drug (looping iteratively to check for A, B, or C and assigning to the corresponding columns) if, in that row, the ATC code matches with the drug reference and the DIN is not contained within the exclusion list. The result should be:
People_Id Drug_ATC Drug_DIN A B C
1001 N123 1 0 0
1001 N123 011235 0 0 0
1001 N32456 011232 0 0 1
1001 N111 0 0 0
1002 B5234 65413 0 0 0
1002 B5234 654090 0 1 0
1002 N123 011235 0 0 0
I understand how to use apply
function within the same df itself, but I don't know how to also use an external df as reference.
Upvotes: 1
Views: 49
Reputation: 4723
This is a working solution using function and iterrows:
def check_rx_condition(row):
for index, col in df_drug_ref.iterrows():
if ((col['Drug_ATC_Id'] in row['Drug_ATC'])&
(row['DRUG_DIN'] not in col['Drug_DIN_Id_Exclusion'])):
row[col['Drug']] = 1
else:
row[col['Drug']] = 0
return row
df_row = df_row.apply(check_rx_condition, axis=1)
Upvotes: 1
Reputation: 29635
First you can split your lists to several columns with apply(pd.Series)
and join
them to df_drug_ref
:
print (df_drug_ref.join(df_drug_ref['Drug_DIN_Id_Exclusion'].apply(pd.Series)))
Drug Drug_ATC_Id Drug_DIN_Id_Exclusion 0 1
0 A N123 [011235] 011235 NaN
1 B B5234 [65413, 654351] 65413 654351
2 C N32456 [] NaN NaN
Then you can merge
on the column 'Drug_ATC' the above joined dataframe to People_Id
, after doing some cleaning on columns:
df_merge = People_Id.merge(df_drug_ref[['Drug', 'Drug_ATC_Id']]
.join(df_drug_ref['Drug_DIN_Id_Exclusion']
.apply(pd.Series)
.add_prefix('Drug_DIN_'))
.rename(columns={'Drug_ATC_Id':'Drug_ATC'}),
how='left')
to get df_merge
:
People_Id Drug_ATC Drug_DIN Drug Drug_DIN_0 Drug_DIN_1
0 1001 N123 A 011235 NaN
1 1001 N123 011235 A 011235 NaN
2 1001 N32456 011235 C NaN NaN
3 1001 N111 NaN NaN NaN
4 1002 B5234 65413 B 65413 654351
5 1002 B5234 654090 B 65413 654351
6 1002 N123 011235 A 011235 NaN
Now you can replace the column 'Drug' with NaN where the value in 'Drug_DIN' is in one of the columns 'Drug_DIN_i' with np.any
:
mask = np.any(df_merge.filter(like='Drug_DIN').iloc[:,:1].values ==
df_merge.filter(like='Drug_DIN').iloc[:,1:].values, axis=1)
df_merge.loc[mask,'Drug'] = np.nan
Finally, to create the columns A, B, C ... you can use pd.get_dummies
with set_index
and then reset_index
:
new_People_Id = pd.get_dummies(df_merge.set_index(['People_Id','Drug_ATC','Drug_DIN'])['Drug']).reset_index()
print (new_People_Id)
People_Id Drug_ATC Drug_DIN A B C
0 1001 N123 1 0 0
1 1001 N123 011235 0 0 0
2 1001 N32456 011235 0 0 1
3 1001 N111 0 0 0
4 1002 B5234 65413 0 0 0
5 1002 B5234 654090 0 1 0
6 1002 N123 011235 0 0 0
Note here you can also use join
such as:
new_People_Id = df_merge[['People_Id','Drug_ATC','Drug_DIN']].join(df_merge['Drug'].str.get_dummies())
maybe faster.
Upvotes: 1