Reputation: 1357
I have a df with numbers in the second column. Each number represents the length of a DNA sequence. I would like to create two new columns in which the first one says where this sequence start and the second one says where this sequence end.
This is my current df:
Names LEN
0 Ribosomal_S9: 121
1 Ribosomal_S8: 129
2 Ribosomal_L10: 100
3 GrpE: 166
4 DUF150: 141
.. ... ...
115 TIGR03632: 117
116 TIGR03654: 175
117 TIGR03723: 314
118 TIGR03725: 212
119 TIGR03953: 188
[120 rows x 2 columns]
And this is what I am trying to get
Names LEN Start End
0 Ribosomal_S9: 121 0 121
1 Ribosomal_S8: 129 121 250
2 Ribosomal_L10: 100 250 350
3 GrpE: 166 350 516
4 DUF150: 141 516 657
.. ... ... ... ..
115 TIGR03632: 117
116 TIGR03654: 175
117 TIGR03723: 314
118 TIGR03725: 212
119 TIGR03953: 188
[120 rows x 4 columns]
Can please anyone put me in the right direction?
Upvotes: 1
Views: 33
Reputation: 862731
Use DataFrame.assign
with new columns created with Series.cumsum
and for start is added Series.shift
:
#convert column to integers
df['LEN'] = df['LEN'].astype(int)
#alternative for replace non numeric to missing values
#df['LEN'] = pd.to_numeric(df['LEN'], errors='coerce')
s = df['LEN'].cumsum()
df = df.assign(Start = s.shift(fill_value=0), End = s)
print (df)
Names LEN Start End
0 Ribosomal_S9: 121 0 121
1 Ribosomal_S8: 129 121 250
2 Ribosomal_L10: 100 250 350
3 GrpE: 166 350 516
4 DUF150: 141 516 657
Upvotes: 1