Reputation: 9437
I have a pandas Dataframe as
position base text
1 458372 A 19:t|12:cg|7:CG|1:tcag|1:T
2 458373 C 21:GCA|3:GCG|3:ATA|2:GCGAA|1:GTA|1:CGAG|1:g
I would like to retrieve the numbers from the text column and add them up in another column. Values in the text column contain numbers separated by any non-numerical values [^0-9]
. In the first row value of the text column, the numbers are 19, 12, 7, 1, and 1, which would add to 40, which would be a value in the new column. The resulting Dataframe would look like:
position base text text_sum
1 458372 A 19:t|12:cg|7:CG|1:tcag|1:T 40
2 458373 C 21:GCA|3:GCG|3:ATA|2:GCGAA|1:GTA|1:CGAG|1:g 32
Any clues as to how to approach this?
Upvotes: 2
Views: 231
Reputation: 28709
An alternative - list comprehension - because they are strings; I'd suggest testing the speed though as the size grows:
import re
df.assign(
text_sum=[sum(int(num)
for num in re.split("[:|]", val)
if num.isnumeric())
for val in df.text])
position base text text_sum
1 458372 A 19:t|12:cg|7:CG|1:tcag|1:T 40
2 458373 C 21:GCA|3:GCG|3:ATA|2:GCGAA|1:GTA|1:CGAG|1:g 32
Upvotes: 0
Reputation: 863301
Use Series.str.extractall
for get all numeric, convert to integers and last sum per duplicated index values:
df['text_sum'] = df['text'].str.extractall('(\d+)')[0].astype(int).sum(level=0)
print (df)
position base text text_sum
1 458372 A 19:t|12:cg|7:CG|1:tcag|1:T 40
2 458373 C 21:GCA|3:GCG|3:ATA|2:GCGAA|1:GTA|1:CGAG|1:g 32
Or if possible sum values splitted by |
and then before :
use:
df['text_sum'] = df['text'].apply(lambda x: sum(int(y.split(':')[0]) for y in x.split('|')))
print (df)
position base text text_sum
1 458372 A 19:t|12:cg|7:CG|1:tcag|1:T 40
2 458373 C 21:GCA|3:GCG|3:ATA|2:GCGAA|1:GTA|1:CGAG|1:g 32
Upvotes: 2