Reputation: 125
I have two dataframes containing different information about the same patients. I need to use dataframe 1 to filter dataframe 2 so that dataframe 2 will only keep its integer patient row values if there is an integer value in df_1
for the same chromosome
, strand
, elementloc
, and patient
. If there is an NaN
value in df_1, I'd like to put NaN
in df_2
in that same location. For NaN
values already in df_2
, I'd like to leave them as NaN.
So with df_1
and df_2
like:
df_1 = pd.DataFrame({'chromosome': [1, 1, 5, 4],
'strand': ['-', '-', '+', '-'],
'elementloc': [4991, 8870, 2703, 9674],
'Patient1_Reads': ['NaN', 25, 50, 'NaN'],
'Patient2_Reads': [35, 200, 'NaN', 500]})
print(df_1)
chromosome strand elementloc Patient1_Reads Patient2_Reads
0 1 - 4991 NaN 35
1 1 - 8870 25 200
2 5 + 2703 50 NaN
3 4 - 9674 NaN 500
df_2 = pd.DataFrame({'chromosome': [1, 1, 5, 4],
'strand': ['-', '-', '+', '-'],
'elementloc': [4991, 8870, 2703, 9674],
'Patient1_PSI': [0.76, 0.35, 0.04, 'NaN'],
'Patient2_PSI': [0.89, 0.15, 0.47, 0.32]})
print(df_2)
chromosome strand elementloc Patient1_PSI Patient2_PSI
0 1 - 4991 0.76 0.89
1 1 - 8870 0.35 0.15
2 5 + 2703 0.04 0.47
3 4 - 9674 NaN 0.32
I would like new df_2
to look like:
chromosome strand elementloc Patient1_PSI Patient2_PSI
0 1 - 4991 NaN 0.89
1 1 - 8870 0.35 0.15
2 5 + 2703 0.04 NaN
3 4 - 9674 NaN 0.32
Upvotes: 2
Views: 47
Reputation: 71689
Use:
df3 = df1.merge(df2, on=['chromosome', 'strand', 'elementloc'])
r_cols = df3.columns[df3.columns.str.endswith('_Reads')]
p_cols = r_cols.str.strip('Reads') + 'PSI'
df3[p_cols] = df3[p_cols].mask(df3[r_cols].isna().to_numpy())
df3 = df3.drop(r_cols, 1)
Details:
STEP A: Use DataFrame.merge
to create a merged dataframe df3
obtained by merging the dataframes df1
and df2
on ['chromosome', 'strand', 'elementloc']
.
# print(df3)
chromosome strand elementloc Patient1_Reads Patient2_Reads Patient1_PSI Patient2_PSI
0 1 - 4991 NaN 35.0 0.76 0.89
1 1 - 8870 25.0 200.0 0.35 0.15
2 5 + 2703 50.0 NaN 0.04 0.47
3 4 - 9674 NaN 500.0 NaN 0.32
STEP B: Use .str.endswith
to get the columns in df3
which ends with _Reads
we call this columns r_cols
, then use this _Reads
columns to obtain the corresponding _PSI
columns we call this columns p_cols
.
# print(r_cols)
Index(['Patient1_Reads', 'Patient2_Reads'], dtype='object')
# print(p_cols)
Index(['Patient1_PSI', 'Patient2_PSI'], dtype='object')
STEP C: Use DataFrame.isna
on the _Reads
columns to obtain the boolean mask, then use this mask along with DataFrame.mask
to fill the correponding NaN
values in _PSI
columns. Finally use DataFrame.drop
to drop the _Reads
column from the merged datframe df3
to get the desired result:
# print(df3)
chromosome strand elementloc Patient1_PSI Patient2_PSI
0 1 - 4991 NaN 0.89
1 1 - 8870 0.35 0.15
2 5 + 2703 0.04 NaN
3 4 - 9674 NaN 0.32
Upvotes: 2