Reputation:
I have a dataframe as follows df["Annotations"]
missense_variant&splice_region_variant
stop_gained&splice_region_variant
splice_acceptor_variant&coding_sequence_variant&intron_variant
splice_donor_variant&splice_acceptor_variant&coding_sequence_variant&5_prime_UTR_variant&intron_variant
missense_variant&NMD_transcript_variant
frameshift_variant&splice_region_variant
splice_acceptor_variant&intron_variant
splice_acceptor_variant&coding_sequence_variant
stop_lost&3_prime_UTR_variant
missense_variant
splice_region_variant
I want to replace or add a new column with priority of orders. Priority is given as
Type Rank
frameshift_variant 1
stop_gained 2
splice_region_variant 3
splice_acceptor_variant 4
splice_donor_variant 5
missense_variant 6
coding_sequence_variant 7
I want to get replace df['Annotations'] or add new column df['Anno_prio'] as:
splice_region_variant
stop_gained
splice_acceptor_variant
splice_acceptor_variant
missense_variant
frameshift_variant
splice_acceptor_variant
splice_acceptor_variant
stop_lost
missense_variant
splice_region_variant
The way I tried was for each term:
df['Annotation']=df['Annotation'].str.replace('missense_variant&splice_region_variant','splice_region_variant')
Are there any other approach to do it using pandas?
Upvotes: -1
Views: 114
Reputation: 862611
Idea is create dictionary with get
with default value by next value after maximal Rank
for each value of splitted lists in dictionary comprehension and then get key of minimal value of dict:
d = df1.set_index('Type')['Rank'].to_dict()
max1 = df1['Rank'].max()+1
def f(x):
d1 = {y: d.get(y, max1) for y in x for y in x.split('&')}
#https://stackoverflow.com/a/280156/2901002
return min(d1, key=d1.get)
df['Anno_prio'] = df['Annotations'].apply(f)
print (df)
Annotations Anno_prio
0 missense_variant&splice_region_variant splice_region_variant
1 stop_gained&splice_region_variant stop_gained
2 splice_acceptor_variant&coding_sequence_varian... splice_acceptor_variant
3 splice_donor_variant&splice_acceptor_variant&c... splice_acceptor_variant
4 missense_variant&NMD_transcript_variant missense_variant
5 frameshift_variant&splice_region_variant frameshift_variant
6 splice_acceptor_variant&intron_variant splice_acceptor_variant
7 splice_acceptor_variant&coding_sequence_variant splice_acceptor_variant
8 stop_lost&3_prime_UTR_variant stop_lost
9 missense_variant missense_variant
10 splice_region_variant splice_region_variant
Pandas only solution use DataFrame.explode
with DataFrame.sort_values
and last is removed duplicated index values with sorting index:
d = df1.set_index('Type')['Rank'].to_dict()
df = (df.assign(Anno_prio = df['Annotations'].str.split('&'))
.explode('Anno_prio')
.assign(new = lambda x: x['Anno_prio'].map(d))
.sort_values('new')
)
df = df[~df.index.duplicated()].sort_index()
print (df)
Annotations \
0 missense_variant&splice_region_variant
1 stop_gained&splice_region_variant
2 splice_acceptor_variant&coding_sequence_varian...
3 splice_donor_variant&splice_acceptor_variant&c...
4 missense_variant&NMD_transcript_variant
5 frameshift_variant&splice_region_variant
6 splice_acceptor_variant&intron_variant
7 splice_acceptor_variant&coding_sequence_variant
8 stop_lost&3_prime_UTR_variant
9 missense_variant
10 splice_region_variant
Anno_prio new
0 splice_region_variant 3.0
1 stop_gained 2.0
2 splice_acceptor_variant 4.0
3 splice_acceptor_variant 4.0
4 missense_variant 6.0
5 frameshift_variant 1.0
6 splice_acceptor_variant 4.0
7 splice_acceptor_variant 4.0
8 stop_lost NaN
9 missense_variant 6.0
10 splice_region_variant 3.0
Upvotes: 0
Reputation: 5601
process:
pandas.Series.explode
transform each element of a list-like to a row.Type
to Rank
Annotations
anno_map = df_rank.set_index('Type')['Rank']
obj_anno_split = df['Annotations'].str.split('&')
df_anno_map = obj_anno_split.explode().reset_index()
# create a new column rank use map
df_anno_map['rank'] = df_anno_map['Annotations'].map(anno_map)
# keep the first rank for every index, by sort and drop_duplicates
df_anno_map = (df_anno_map.dropna()
.sort_values('rank')
.drop_duplicates('index', keep='first')
.set_index('index')
.sort_index())
# assing Anno_prio with index broadcast
df['Anno_prio'] = df_anno_map['Annotations']
# fillna with the the split's first item
df['Anno_prio'] = df['Anno_prio'].combine_first(obj_anno_split.str[0])
# print(df_anno_map)
# print(df)
result:
print(df_anno_map)
Annotations rank
index
0 splice_region_variant 3.0
1 stop_gained 2.0
2 splice_acceptor_variant 4.0
3 splice_acceptor_variant 4.0
4 missense_variant 6.0
5 frameshift_variant 1.0
6 splice_acceptor_variant 4.0
7 splice_acceptor_variant 4.0
9 missense_variant 6.0
10 splice_region_variant 3.0
print(df)
Annotations Anno_prio
0 missense_variant&splice_region_variant splice_region_variant
1 stop_gained&splice_region_variant stop_gained
2 splice_acceptor_variant&coding_sequence_varian... splice_acceptor_variant
3 splice_donor_variant&splice_acceptor_variant&c... splice_acceptor_variant
4 missense_variant&NMD_transcript_variant missense_variant
5 frameshift_variant&splice_region_variant frameshift_variant
6 splice_acceptor_variant&intron_variant splice_acceptor_variant
7 splice_acceptor_variant&coding_sequence_variant splice_acceptor_variant
8 stop_lost&3_prime_UTR_variant stop_lost
9 missense_variant missense_variant
10 splice_region_variant splice_region_variant
Upvotes: 1