Julio Diaz
Julio Diaz

Reputation: 9437

Get numerical values from Dataframe column and add them to separate column

I have a pandas Dataframe as

    position base  text
1   458372   A    19:t|12:cg|7:CG|1:tcag|1:T
2   458373   C    21:GCA|3:GCG|3:ATA|2:GCGAA|1:GTA|1:CGAG|1:g

I would like to retrieve the numbers from the text column and add them up in another column. Values in the text column contain numbers separated by any non-numerical values [^0-9]. In the first row value of the text column, the numbers are 19, 12, 7, 1, and 1, which would add to 40, which would be a value in the new column. The resulting Dataframe would look like:

    position base  text                                          text_sum 
1   458372   A    19:t|12:cg|7:CG|1:tcag|1:T                    40
2   458373   C    21:GCA|3:GCG|3:ATA|2:GCGAA|1:GTA|1:CGAG|1:g   32

Any clues as to how to approach this?

Upvotes: 2

Views: 231

Answers (2)

sammywemmy
sammywemmy

Reputation: 28709

An alternative - list comprehension - because they are strings; I'd suggest testing the speed though as the size grows:

import re

df.assign(
    text_sum=[sum(int(num) 
              for num in re.split("[:|]", val)
              if num.isnumeric())
              for val in df.text])

    position  base     text                                    text_sum
1   458372     A    19:t|12:cg|7:CG|1:tcag|1:T                      40
2   458373     C    21:GCA|3:GCG|3:ATA|2:GCGAA|1:GTA|1:CGAG|1:g     32

Upvotes: 0

jezrael
jezrael

Reputation: 863301

Use Series.str.extractall for get all numeric, convert to integers and last sum per duplicated index values:

df['text_sum'] = df['text'].str.extractall('(\d+)')[0].astype(int).sum(level=0) 
print (df)

   position base                                         text  text_sum
1    458372    A                   19:t|12:cg|7:CG|1:tcag|1:T        40
2    458373    C  21:GCA|3:GCG|3:ATA|2:GCGAA|1:GTA|1:CGAG|1:g        32

Or if possible sum values splitted by | and then before : use:

df['text_sum'] = df['text'].apply(lambda x: sum(int(y.split(':')[0]) for y in x.split('|')))
print (df)
   position base                                         text  text_sum
1    458372    A                   19:t|12:cg|7:CG|1:tcag|1:T        40
2    458373    C  21:GCA|3:GCG|3:ATA|2:GCGAA|1:GTA|1:CGAG|1:g        32

Upvotes: 2

Related Questions