Drewaight
Drewaight

Reputation: 75

Mapping Tuple Dictionary to Multiple columns of a DataFrame

Ive got a PDB DataFrame with residue insertion codes. Simplified example.

d = {'ATOM' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], 
    'residue_number' : [2, 2, 2, 3, 3, 3, 3, 3, 3, 5, 5, 5],
    'insertion' : ['', '', '', '', '', '', 'A', 'A', 'A', '', '', '']} 

df = pd.DataFrame(data = d)

Dataframe:

    ATOM  residue_number insertion
0      1               2          
1      2               2          
2      3               2          
3      4               3          
4      5               3          
5      6               3          
6      7               3     A
7      8               3     A
8      9               3     A
9     10               5          
10    11               5          
11    12               5          

I need to renumber the residues according to a different numbering and insertion scheme. Output from the renumbering script can be formatted into a dictionary of tuples, e.g.

my_dict = {(2,): 1, (3,): 2, (3, 'A') : 3, (5, ) : (4, 'A') }

Is it possible to map this dictionary of tuples onto the two columns ['ATOM']['insertion']? The desired output would be:

    ATOM  residue_number insertion
0      1               1          
1      2               1          
2      3               1          
3      4               2          
4      5               2          
5      6               2          
6      7               3         
7      8               3         
8      9               3         
9     10               4      A          
10    11               4      A          
11    12               4      A          

I've been searching and banging my head on this for a few days, I've tried mapping and multindex but cant seem to find a way to map a dictionary of tuples across multiple columns. I feel like I'm thinking about it wrong somehow. Thanks for any advice!

Upvotes: 4

Views: 2173

Answers (3)

ansev
ansev

Reputation: 30930

I think we can create a DataFrame with your dictionary after modifying it to set all values ​​as tuples. So we can use DataFrame.join or . I think this is easier( and recommended) if we convert the blank values ​​of the insertion column to NaN.

import numpy as np
new_df = ( df.assign(insertion = df['insertion'].replace(r'^\s*$',
                                                         np.nan,
                                                         regex=True)
                                                .mask(df['insertion'].isnull()))
              .join(pd.DataFrame({x:(y if isinstance(y,tuple) else (y,np.nan))
                                  for x,y in my_dict.items()},
                                  index = ['new_residue_number','new_insertion']).T,
                     on = ['residue_number','insertion'])
              .fillna('')
              .drop(['residue_number','insertion'],axis=1)
              .rename(columns = {'new_residue_number':'residue_number',
                                 'new_insertion':'insertion'}))
print(new_df)
    ATOM  residue_number insertion
0      1             1.0          
1      2             1.0          
2      3             1.0          
3      4             2.0          
4      5             2.0          
5      6             2.0          
6      7             3.0          
7      8             3.0          
8      9             3.0          
9     10             4.0         A
10    11             4.0         A
11    12             4.0         A

Detail

print(pd.DataFrame({x:(y if isinstance(y,tuple) else (y,np.nan))
                    for x,y in my_dict.items()},
                   index = ['new_residue_number','new_insertion']).T)
      new_residue_number new_insertion
2 NaN                  1           NaN
3 NaN                  2           NaN
  A                    3           NaN
5 NaN                  4             A

Upvotes: 1

Jorge Avila
Jorge Avila

Reputation: 171

in this case I think that you need to define a function that gets as input your old residue_number and insertion and outputs the new ones. For that, I will work directly from the df, so, to avoid extra coding, I will redefine your my_dict from (2,) to this (2,'')

here is the code:

import pandas as pd
d = {'ATOM' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], 
    'residue_number' : [2, 2, 2, 3, 3, 3, 3, 3, 3, 5, 5, 5],
    'insertion' : ['', '', '', '', '', '', 'A', 'A', 'A', '', '', '']} 

df = pd.DataFrame(data = d)

# Our new dict with keys and values as tuples
my_new_dict = {(2,''): (1,''), (3,''): (2,''), (3,'A'): (3,''), (5,''): (4,'A') }

# We need a function that maps a tuple (residue_number, insertion) into your new_residue_number and new_insertion values
def new_residue_number(residue_number, insertion, my_new_dict):
    # keys are tuples
    key = (residue_number, insertion)
    # Return new residue_number and insertion values
    return my_new_dict[key]

# Example to see how this works
print(new_residue_number(2, '', my_new_dict)) # Output (1,'')
print(new_residue_number(5, '', my_new_dict)) # Output (4, 'A')
print(new_residue_number(3, 'A', my_new_dict)) # Output (3,'')

# Now we apply this to our df and save it in the same df in two new columns
df[['new_residue_number','new_insertion']] = df.apply(lambda row: pd.Series(new_residue_number(row['residue_number'], row['insertion'], my_new_dict)), axis=1)

output here

I hope this can solve your problem!

Upvotes: 2

ALollz
ALollz

Reputation: 59579

The logic here is a simple merge. But we need to do a lot of work to turn that dictionary into a suitable DataFrame for mapping. I'd reconsider whether you can store the renumbering output as my final s DataFrame from the start.

#Turn the dict into a mapping
s = pd.DataFrame(my_dict.values())[0].explode().to_frame()
s['idx'] = s.groupby(level=0).cumcount()
s = (s.pivot(columns='idx', values=0)
     .rename_axis(None, axis=1) 
     .rename(columns={0: 'new_res', 1: 'new_ins'}))

s.index = pd.MultiIndex.from_tuples([*my_dict.keys()], names=['residue_number', 'insertion'])

s = s.reset_index().fillna('')  # Because you have '' not NaN
#   residue_number insertion  new_res new_ins
#0               2                  1        
#1               3                  2        
#2               3         A        3        
#3               5                  4       A

The mapping is now a merge. I'll leave all columns in for clarity of the logic, but you can use the commented out code to drop the original columns and rename the new columns.

df = df.merge(s, how='left')

# Your real output with 
#df = (df.merge(s, how='left')
#        .drop(columns=['residue_number', 'insertion'])
#        .rename(columns={'new_res': 'residue_number',
#                         'new_ins': 'insertion'}))

    ATOM  residue_number insertion  new_res new_ins
0      1               2                  1        
1      2               2                  1        
2      3               2                  1        
3      4               3                  2        
4      5               3                  2        
5      6               3                  2        
6      7               3         A        3        
7      8               3         A        3        
8      9               3         A        3        
9     10               5                  4       A
10    11               5                  4       A
11    12               5                  4       A

Upvotes: 0

Related Questions