Reputation:
I have a dataframe as follows:
data = {'CHROM':['chr1', 'chr2', 'chr1', 'chr3', 'chr1'],
'POS':[939570,3411794,1043223,22511093,24454031],
'REF':['T', 'T', 'CCT', 'CTT', 'CT'],
'ALT':['TCCCTGGAGGACC', 'C', 'C', 'CT', 'CTT'],
'Len_REF':[1,1,3,3,2], 'Len_ALT':[13,1,1,2,3]
}
df1 = pd.DataFrame(data)
It looks as follows: df1
CHROM POS REF ALT Len_REF Len_ALT
0 chr1 939570 T TCCCTGGAGGACC 1 13
1 chr2 3411794 T C 1 1
2 chr1 1043223 CCT C 3 1
3 chr3 22511093 CTT CT 3 2
4 chr1 24454031 CT CTT 2 3
I wanted to add new columns to the dataframe based on the column values such that it look as follows:
Positions Allele Combined
1:939570-939570 CCCTGGAGGACC 1:939570-939570:CCCTGGAGGACC
2:3411794-3411794 C 2:3411794-3411794:C
1:1043223-1043225 - 1:1043223-1043225:-
3:22511093-22511095 - 3:22511093-22511095:-
1:24454031-24454032 T 1:24454031-24454032:T
the df1['Positions']
are generated based on the values in CHROM
& POS
with respect to change in REF
and ALT
.
df1['Allele']
are made using the REF
& ALT
Upvotes: 0
Views: 66
Reputation: 16683
Positions
column: remove non-numerical valiues with \D+
and str.repalce
from CHROM column and manipulate rest of string as desiredAllele
column: You can compare ALT
and Len_REF
row-wise and index ALT
based off the Len_REF
value dynamically. Make sure to pass axis=1
:df2['Positions'] = (df2['CHROM'].str.replace('\D+', '').astype(str)
+ ':' + df2['POS'].astype(str)
+ '-' + (df2['POS'] + df2['Len_REF'] - 1).astype(str))
df2['Allele'] = df2.apply(lambda x: x['ALT'][x['Len_REF']:], axis=1).replace('','-')
df2['Combined'] = df2['Positions'] + ':' + df2['Allele']
df2.iloc[:,-3:]
Out[1]:
Positions Allele Combined
0 1:939570-939570 CCCTGGAGGACC 1:939570-939570:CCCTGGAGGACC
1 2:3411794-3411794 - 2:3411794-3411794:-
2 1:1043223-1043225 - 1:1043223-1043225:-
3 3:22511093-22511095 - 3:22511093-22511095:-
4 1:24454031-24454032 T 1:24454031-24454032:T
Upvotes: 0