user2110417
user2110417

Reputation:

How to add new columns to the dataframe based on the column values?

I have a dataframe as follows:

data = {'CHROM':['chr1', 'chr2', 'chr1', 'chr3', 'chr1'],
        'POS':[939570,3411794,1043223,22511093,24454031],
        'REF':['T', 'T', 'CCT', 'CTT', 'CT'],
        'ALT':['TCCCTGGAGGACC', 'C', 'C', 'CT', 'CTT'],
        'Len_REF':[1,1,3,3,2], 'Len_ALT':[13,1,1,2,3]
       }
df1 = pd.DataFrame(data)

It looks as follows: df1

    CHROM   POS     REF  ALT            Len_REF   Len_ALT
0   chr1    939570   T   TCCCTGGAGGACC    1         13
1   chr2    3411794  T   C                1          1
2   chr1    1043223  CCT C                3          1
3   chr3    22511093 CTT CT               3          2
4   chr1    24454031 CT  CTT              2          3

I wanted to add new columns to the dataframe based on the column values such that it look as follows:

Positions             Allele         Combined
1:939570-939570       CCCTGGAGGACC   1:939570-939570:CCCTGGAGGACC
2:3411794-3411794     C              2:3411794-3411794:C
1:1043223-1043225     -              1:1043223-1043225:-
3:22511093-22511095   -              3:22511093-22511095:-
1:24454031-24454032   T             1:24454031-24454032:T

the df1['Positions'] are generated based on the values in CHROM & POS with respect to change in REF and ALT.

df1['Allele'] are made using the REF & ALT

Upvotes: 0

Views: 66

Answers (1)

David Erickson
David Erickson

Reputation: 16683

  1. Positions column: remove non-numerical valiues with \D+ and str.repalce from CHROM column and manipulate rest of string as desired
  2. Allele column: You can compare ALT and Len_REF row-wise and index ALT based off the Len_REF value dynamically. Make sure to pass axis=1:

df2['Positions'] =  (df2['CHROM'].str.replace('\D+', '').astype(str)
                     + ':' + df2['POS'].astype(str) 
                     + '-' + (df2['POS'] + df2['Len_REF'] - 1).astype(str))
df2['Allele'] = df2.apply(lambda x: x['ALT'][x['Len_REF']:], axis=1).replace('','-')
df2['Combined'] = df2['Positions'] + ':' + df2['Allele']
df2.iloc[:,-3:]

Out[1]: 
             Positions        Allele                      Combined
0      1:939570-939570  CCCTGGAGGACC  1:939570-939570:CCCTGGAGGACC
1    2:3411794-3411794             -           2:3411794-3411794:-
2    1:1043223-1043225             -           1:1043223-1043225:-
3  3:22511093-22511095             -         3:22511093-22511095:-
4  1:24454031-24454032             T         1:24454031-24454032:T

Upvotes: 0

Related Questions