Manolo Dominguez Becerra
Manolo Dominguez Becerra

Reputation: 1357

Creating new values in panda dataframe using math with existed columns

I have a df with numbers in the second column. Each number represents the length of a DNA sequence. I would like to create two new columns in which the first one says where this sequence start and the second one says where this sequence end.

This is my current df:

    Names  LEN
0     Ribosomal_S9:  121
1     Ribosomal_S8:  129
2    Ribosomal_L10:  100
3             GrpE:  166
4           DUF150:  141
..              ...  ...
115      TIGR03632:  117
116      TIGR03654:  175
117      TIGR03723:  314
118      TIGR03725:  212
119      TIGR03953:  188

[120 rows x 2 columns]

And this is what I am trying to get

              Names  LEN    Start   End
0     Ribosomal_S9:  121     0      121
1     Ribosomal_S8:  129    121     250
2    Ribosomal_L10:  100    250     350 
3             GrpE:  166    350     516
4           DUF150:  141    516     657 
..              ...  ...   ...  ..
115      TIGR03632:  117          
116      TIGR03654:  175          
117      TIGR03723:  314          
118      TIGR03725:  212          
119      TIGR03953:  188          

[120 rows x 4 columns]

Can please anyone put me in the right direction?

Upvotes: 1

Views: 33

Answers (1)

jezrael
jezrael

Reputation: 862731

Use DataFrame.assign with new columns created with Series.cumsum and for start is added Series.shift:

#convert column to integers
df['LEN'] = df['LEN'].astype(int)
#alternative for replace non numeric to missing values
#df['LEN'] = pd.to_numeric(df['LEN'], errors='coerce')

s = df['LEN'].cumsum()
df = df.assign(Start = s.shift(fill_value=0), End = s)
print (df)
            Names  LEN  Start  End
0   Ribosomal_S9:  121      0  121
1   Ribosomal_S8:  129    121  250
2  Ribosomal_L10:  100    250  350
3           GrpE:  166    350  516
4         DUF150:  141    516  657

Upvotes: 1

Related Questions