Rachael
Rachael

Reputation: 315

How can I make my pandas code more efficient?

I have the following Pandas dataframe:


    chr     POS     RS          REF ALT     
1   chr1    981931  rs2465128   A   GA  
2   chr1    982994  rs10267     T   C   
3   chr1    984302  rs9442391   T   C   
4   chr1    987200  rs9803031   C   T   
5   chr1    990280  rs4275402   CT  C   

I would like to have another column that has the value "SNP" if the length of both "REF" and "ALT" columns is equal to 1, and the value "INDEL" if any of them is different from 1, so the output should look like this:

    chr     POS     RS          REF ALT TYPE
1   chr1    981931  rs2465128   A   GA  INDEL
2   chr1    982994  rs10267     T   C   SNP
3   chr1    984302  rs9442391   T   C   SNP
4   chr1    987200  rs9803031   C   T   SNP
5   chr1    990280  rs4275402   CT  C   INDEL

I have written some code and it does work but it is very slow, I was wondering if there is a more efficient way to do this through comprehension lists or lambda functions.

My code

for index, row in table.iterrows():

     if len(row['REF']) == 1 and len(row['ALT']) == 1 :

          table.loc[ index, "TYPE" ] = "SNP"

      else :

          table.loc[ index, "TYPE" ] = "INDEL"

Thanks a lot

Rachael

Upvotes: 1

Views: 60

Answers (1)

jezrael
jezrael

Reputation: 862406

Use Series.str.len for lengths and set new column by numpy.where:

m = (table['REF'].str.len() == 1) & (table['ALT'].str.len() == 1)

table["TYPE"] = np.where(m, "SNP", "INDEL")
print (table)
    chr     POS         RS REF ALT   TYPE
1  chr1  981931  rs2465128   A  GA  INDEL
2  chr1  982994    rs10267   T   C    SNP
3  chr1  984302  rs9442391   T   C    SNP
4  chr1  987200  rs9803031   C   T    SNP
5  chr1  990280  rs4275402  CT   C  INDEL

Upvotes: 4

Related Questions