Reputation: 75
Ive got a PDB DataFrame with residue insertion codes. Simplified example.
d = {'ATOM' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
'residue_number' : [2, 2, 2, 3, 3, 3, 3, 3, 3, 5, 5, 5],
'insertion' : ['', '', '', '', '', '', 'A', 'A', 'A', '', '', '']}
df = pd.DataFrame(data = d)
Dataframe:
ATOM residue_number insertion
0 1 2
1 2 2
2 3 2
3 4 3
4 5 3
5 6 3
6 7 3 A
7 8 3 A
8 9 3 A
9 10 5
10 11 5
11 12 5
I need to renumber the residues according to a different numbering and insertion scheme. Output from the renumbering script can be formatted into a dictionary of tuples, e.g.
my_dict = {(2,): 1, (3,): 2, (3, 'A') : 3, (5, ) : (4, 'A') }
Is it possible to map this dictionary of tuples onto the two columns ['ATOM']['insertion']? The desired output would be:
ATOM residue_number insertion
0 1 1
1 2 1
2 3 1
3 4 2
4 5 2
5 6 2
6 7 3
7 8 3
8 9 3
9 10 4 A
10 11 4 A
11 12 4 A
I've been searching and banging my head on this for a few days, I've tried mapping and multindex but cant seem to find a way to map a dictionary of tuples across multiple columns. I feel like I'm thinking about it wrong somehow. Thanks for any advice!
Upvotes: 4
Views: 2173
Reputation: 30930
I think we can create a DataFrame
with your dictionary after modifying it to set all values as tuples. So we can use DataFrame.join
or . I think this is easier(
and recommended) if we convert the blank values of the insertion
column to NaN
.
import numpy as np
new_df = ( df.assign(insertion = df['insertion'].replace(r'^\s*$',
np.nan,
regex=True)
.mask(df['insertion'].isnull()))
.join(pd.DataFrame({x:(y if isinstance(y,tuple) else (y,np.nan))
for x,y in my_dict.items()},
index = ['new_residue_number','new_insertion']).T,
on = ['residue_number','insertion'])
.fillna('')
.drop(['residue_number','insertion'],axis=1)
.rename(columns = {'new_residue_number':'residue_number',
'new_insertion':'insertion'}))
print(new_df)
ATOM residue_number insertion
0 1 1.0
1 2 1.0
2 3 1.0
3 4 2.0
4 5 2.0
5 6 2.0
6 7 3.0
7 8 3.0
8 9 3.0
9 10 4.0 A
10 11 4.0 A
11 12 4.0 A
Detail
print(pd.DataFrame({x:(y if isinstance(y,tuple) else (y,np.nan))
for x,y in my_dict.items()},
index = ['new_residue_number','new_insertion']).T)
new_residue_number new_insertion
2 NaN 1 NaN
3 NaN 2 NaN
A 3 NaN
5 NaN 4 A
Upvotes: 1
Reputation: 171
in this case I think that you need to define a function that gets as input your old residue_number
and insertion
and outputs the new ones. For that, I will work directly from the df, so, to avoid extra coding, I will redefine your my_dict
from (2,)
to this (2,'')
here is the code:
import pandas as pd
d = {'ATOM' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
'residue_number' : [2, 2, 2, 3, 3, 3, 3, 3, 3, 5, 5, 5],
'insertion' : ['', '', '', '', '', '', 'A', 'A', 'A', '', '', '']}
df = pd.DataFrame(data = d)
# Our new dict with keys and values as tuples
my_new_dict = {(2,''): (1,''), (3,''): (2,''), (3,'A'): (3,''), (5,''): (4,'A') }
# We need a function that maps a tuple (residue_number, insertion) into your new_residue_number and new_insertion values
def new_residue_number(residue_number, insertion, my_new_dict):
# keys are tuples
key = (residue_number, insertion)
# Return new residue_number and insertion values
return my_new_dict[key]
# Example to see how this works
print(new_residue_number(2, '', my_new_dict)) # Output (1,'')
print(new_residue_number(5, '', my_new_dict)) # Output (4, 'A')
print(new_residue_number(3, 'A', my_new_dict)) # Output (3,'')
# Now we apply this to our df and save it in the same df in two new columns
df[['new_residue_number','new_insertion']] = df.apply(lambda row: pd.Series(new_residue_number(row['residue_number'], row['insertion'], my_new_dict)), axis=1)
I hope this can solve your problem!
Upvotes: 2
Reputation: 59579
The logic here is a simple merge. But we need to do a lot of work to turn that dictionary into a suitable DataFrame for mapping. I'd reconsider whether you can store the renumbering output as my final s
DataFrame from the start.
#Turn the dict into a mapping
s = pd.DataFrame(my_dict.values())[0].explode().to_frame()
s['idx'] = s.groupby(level=0).cumcount()
s = (s.pivot(columns='idx', values=0)
.rename_axis(None, axis=1)
.rename(columns={0: 'new_res', 1: 'new_ins'}))
s.index = pd.MultiIndex.from_tuples([*my_dict.keys()], names=['residue_number', 'insertion'])
s = s.reset_index().fillna('') # Because you have '' not NaN
# residue_number insertion new_res new_ins
#0 2 1
#1 3 2
#2 3 A 3
#3 5 4 A
The mapping is now a merge. I'll leave all columns in for clarity of the logic, but you can use the commented out code to drop the original columns and rename the new columns.
df = df.merge(s, how='left')
# Your real output with
#df = (df.merge(s, how='left')
# .drop(columns=['residue_number', 'insertion'])
# .rename(columns={'new_res': 'residue_number',
# 'new_ins': 'insertion'}))
ATOM residue_number insertion new_res new_ins
0 1 2 1
1 2 2 1
2 3 2 1
3 4 3 2
4 5 3 2
5 6 3 2
6 7 3 A 3
7 8 3 A 3
8 9 3 A 3
9 10 5 4 A
10 11 5 4 A
11 12 5 4 A
Upvotes: 0