user2110417
user2110417

Reputation:

How to replace the column of dataframe based on priority order?

I have a dataframe as follows df["Annotations"]

missense_variant&splice_region_variant
stop_gained&splice_region_variant
splice_acceptor_variant&coding_sequence_variant&intron_variant
splice_donor_variant&splice_acceptor_variant&coding_sequence_variant&5_prime_UTR_variant&intron_variant
missense_variant&NMD_transcript_variant
frameshift_variant&splice_region_variant
splice_acceptor_variant&intron_variant
splice_acceptor_variant&coding_sequence_variant
stop_lost&3_prime_UTR_variant
missense_variant
splice_region_variant

I want to replace or add a new column with priority of orders. Priority is given as

Type                 Rank
frameshift_variant      1
stop_gained             2
splice_region_variant   3
splice_acceptor_variant 4
splice_donor_variant    5
missense_variant        6
coding_sequence_variant 7

I want to get replace df['Annotations'] or add new column df['Anno_prio'] as:

splice_region_variant
stop_gained
splice_acceptor_variant
splice_acceptor_variant
missense_variant
frameshift_variant
splice_acceptor_variant
splice_acceptor_variant
stop_lost
missense_variant
splice_region_variant

The way I tried was for each term:

df['Annotation']=df['Annotation'].str.replace('missense_variant&splice_region_variant','splice_region_variant')

Are there any other approach to do it using pandas?

Upvotes: -1

Views: 114

Answers (2)

jezrael
jezrael

Reputation: 862611

Idea is create dictionary with get with default value by next value after maximal Rank for each value of splitted lists in dictionary comprehension and then get key of minimal value of dict:

d = df1.set_index('Type')['Rank'].to_dict()
max1 = df1['Rank'].max()+1    

def f(x):
    d1 = {y: d.get(y, max1) for y in x for y in x.split('&')}
    #https://stackoverflow.com/a/280156/2901002
    return min(d1, key=d1.get)

df['Anno_prio'] = df['Annotations'].apply(f)
print (df)
                                          Annotations                Anno_prio
0              missense_variant&splice_region_variant    splice_region_variant
1                   stop_gained&splice_region_variant              stop_gained
2   splice_acceptor_variant&coding_sequence_varian...  splice_acceptor_variant
3   splice_donor_variant&splice_acceptor_variant&c...  splice_acceptor_variant
4             missense_variant&NMD_transcript_variant         missense_variant
5            frameshift_variant&splice_region_variant       frameshift_variant
6              splice_acceptor_variant&intron_variant  splice_acceptor_variant
7     splice_acceptor_variant&coding_sequence_variant  splice_acceptor_variant
8                       stop_lost&3_prime_UTR_variant                stop_lost
9                                    missense_variant         missense_variant
10                              splice_region_variant    splice_region_variant

Pandas only solution use DataFrame.explode with DataFrame.sort_values and last is removed duplicated index values with sorting index:

d = df1.set_index('Type')['Rank'].to_dict()

df = (df.assign(Anno_prio = df['Annotations'].str.split('&'))
        .explode('Anno_prio')
        .assign(new = lambda x: x['Anno_prio'].map(d))
        .sort_values('new')
        )
df = df[~df.index.duplicated()].sort_index()

print (df)
                                          Annotations  \
0              missense_variant&splice_region_variant   
1                   stop_gained&splice_region_variant   
2   splice_acceptor_variant&coding_sequence_varian...   
3   splice_donor_variant&splice_acceptor_variant&c...   
4             missense_variant&NMD_transcript_variant   
5            frameshift_variant&splice_region_variant   
6              splice_acceptor_variant&intron_variant   
7     splice_acceptor_variant&coding_sequence_variant   
8                       stop_lost&3_prime_UTR_variant   
9                                    missense_variant   
10                              splice_region_variant   

                  Anno_prio  new  
0     splice_region_variant  3.0  
1               stop_gained  2.0  
2   splice_acceptor_variant  4.0  
3   splice_acceptor_variant  4.0  
4          missense_variant  6.0  
5        frameshift_variant  1.0  
6   splice_acceptor_variant  4.0  
7   splice_acceptor_variant  4.0  
8                 stop_lost  NaN  
9          missense_variant  6.0  
10    splice_region_variant  3.0  

Upvotes: 0

Ferris
Ferris

Reputation: 5601

process:

  1. Split by "&" and use pandas.Series.explode transform each element of a list-like to a row.
  2. use map Series to convert the Type to Rank
  3. then sort Rank and drop_duplicates with origin index
  4. fillna with the first Type in Annotations
anno_map = df_rank.set_index('Type')['Rank']
obj_anno_split = df['Annotations'].str.split('&')
df_anno_map = obj_anno_split.explode().reset_index()
# create a new column rank use map
df_anno_map['rank'] = df_anno_map['Annotations'].map(anno_map)

# keep the first rank for every index, by sort and drop_duplicates
df_anno_map = (df_anno_map.dropna()
                  .sort_values('rank')
                  .drop_duplicates('index', keep='first')
                  .set_index('index')
                  .sort_index())

# assing Anno_prio with index broadcast
df['Anno_prio'] = df_anno_map['Annotations']

# fillna with the the split's first item
df['Anno_prio'] = df['Anno_prio'].combine_first(obj_anno_split.str[0])

# print(df_anno_map)
# print(df)

result:

print(df_anno_map)

                  Annotations  rank
index                               
0        splice_region_variant   3.0
1                  stop_gained   2.0
2      splice_acceptor_variant   4.0
3      splice_acceptor_variant   4.0
4             missense_variant   6.0
5           frameshift_variant   1.0
6      splice_acceptor_variant   4.0
7      splice_acceptor_variant   4.0
9             missense_variant   6.0
10       splice_region_variant   3.0

print(df)
                                         Annotations                Anno_prio
0              missense_variant&splice_region_variant    splice_region_variant
1                   stop_gained&splice_region_variant              stop_gained
2   splice_acceptor_variant&coding_sequence_varian...  splice_acceptor_variant
3   splice_donor_variant&splice_acceptor_variant&c...  splice_acceptor_variant
4             missense_variant&NMD_transcript_variant         missense_variant
5            frameshift_variant&splice_region_variant       frameshift_variant
6              splice_acceptor_variant&intron_variant  splice_acceptor_variant
7     splice_acceptor_variant&coding_sequence_variant  splice_acceptor_variant
8                       stop_lost&3_prime_UTR_variant                stop_lost
9                                    missense_variant         missense_variant
10                              splice_region_variant    splice_region_variant

Upvotes: 1

Related Questions