Lin Caijin
Lin Caijin

Reputation: 599

How to calculate the sum of the numbers in character column in R?

I have a dataset, a column of which was like this:

x$Intervar
[1] " Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "
[2] " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "            
[3] " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "            
[4] " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "            
[5] " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "            
[6] " Likely benign PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 1, 0, 0, 1, 0] "         
[7] " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "            
[8] " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "  

Each row follows the same pattern:

Pathogenic/Likely pathogenic/Uncertain significance/Likely benign/Benign PVS1=0 PS=[0,0,0,0,0] PM=[0,0,0,0,0,0,0] PP=[0,0,0,0,0,0] BA1=0 BS=[0,0,0,0,0] BP=[0,0,0,0,0,0,0,0]

I want to split it into multiple columns and calculate the sum of the number in the bracket.

x
  PVS1   PS   PM   PP  BA1   BS   BP
1    0    0    1    0    0    0    0
2    1    0    1    1    0    0    0
3    1    0    1    1    0    0    0
4    1    0    1    1    0    0    0
5    1    0    1    1    0    0    0
6    0    0    1    0    0    0    2
7    1    0    1    1    0    0    0
8    1    0    1    1    0    0    0

Upvotes: 2

Views: 702

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 269644

1) Convert the lines to dcf format by

  • placing a newline before each ...= and
  • place \nDescription= at the beginning of each line
  • in the [...] portions scan the numbers in and sum them using gsubfn and
  • replace the equal signs with a colon and space

Finally use read.dcf to create a character matrix, convert that to a data frame and use type.convert to convert the columns containing numbers to numeric and the first column to factor. If character rather than factor is preferred add the as.is=TRUE argument to type.convert.

If you don't want the first column add %>% .[-1] to the end of the pipeline.

library(gsubfn)
library(magrittr)

x$Intervar %>%
  gsub(" (\\w+=)", "\n\\1", .) %>%
  paste0("\nDescription=", .) %>%
  gsubfn("\\[(.*?)\\]", ~ sum(scan(text = txt, sep = ",", quiet = TRUE)), .) %>%
  gsub("=", ": ", .) %>%
  textConnection %>%
  read.dcf %>%
  as.data.frame %>%
  type.convert

giving this data frame:

             Description PVS1 PS PM PP BA1 BS BP
1 Uncertain significance    0  0  1  0   0  0  0
2             Pathogenic    1  0  1  1   0  0  0
3             Pathogenic    1  0  1  1   0  0  0
4             Pathogenic    1  0  1  1   0  0  0
5             Pathogenic    1  0  1  1   0  0  0
6          Likely benign    0  0  1  0   0  0  2
7             Pathogenic    1  0  1  1   0  0  0
8             Pathogenic    1  0  1  1   0  0  0

2) This is a base R approach. We remove everything before PSV1= and replace [ with space, ] with comma, comma with + and insert a comma after fields with no [...]. Then surround that with c(...) so at this point each line looks like:

"c( PVS1=0, PS= 0+ 0+ 0+ 0+ 0, PM= 0+ 1+ 0+ 0+ 0+ 0+ 0, PP= 0+ 0+ 0+ 0+ 0+ 0, BA1=0, BS= 0+ 0+ 0+ 0+ 0, BP= 0+ 0+ 0+ 0+ 0+ 0+ 0+ 0,  NULL)"

Now parse and eval each line creating a numeric vector out of each such string and rbind those together. We used the Bizarro pipe which only uses base R.

x$Intervar ->.;
  sub(".*PVS1=", "PVS1=", .) ->.;
  chartr("[],", " ,+", .) ->.;
  gsub("=(\\d)", "=\\1,", .) ->.;
  paste("c(", ., "NULL)") ->.;
  lapply(., function(x) eval(parse(text = x))) ->.;
  do.call("rbind", .)

giving this numeric matrix:

     PVS1 PS PM PP BA1 BS BP
[1,]    0  0  1  0   0  0  0
[2,]    1  0  1  1   0  0  0
[3,]    1  0  1  1   0  0  0
[4,]    1  0  1  1   0  0  0
[5,]    1  0  1  1   0  0  0
[6,]    0  0  1  0   0  0  2
[7,]    1  0  1  1   0  0  0
[8,]    1  0  1  1   0  0  0

Note

The input in reproducible form is:

x <- 
structure(list(Intervar = c(" Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] ", 
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] ", 
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] ", 
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] ", 
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] ", 
" Likely benign PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 1, 0, 0, 1, 0] ", 
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] ", 
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "
)), class = "data.frame", row.names = c(NA, -8L))

Upvotes: 3

AnilGoyal
AnilGoyal

Reputation: 26218

Your strings are a lot messy, still you can do the following

cols = c("Col1", "PVS", "PS", "PM", "PP", "BA1", "BS", "BP")
df %>% separate(x, sep = '=', into = cols) %>%
  mutate(Col1 = str_remove(Col1, ' PVS1')) %>%
  mutate(PVS = str_remove_all(PVS, ' PS')) %>%
  mutate(PS = str_remove_all(PS, ' PM|\\[|\\]'),
         PS = str_split(PS, ","),
         PS = map_dbl(PS, ~sum(as.numeric(.x)))) %>%
  mutate(PM = str_remove_all(PM, ' PP|\\[|\\]'),
         PM = str_split(PM, ","),
         PM = map_dbl(PM, ~sum(as.numeric(.x)))) %>%
  mutate(PP = str_remove_all(PP, ' BA1|\\[|\\]'),
         PP = str_split(PP, ","),
         PP = map_dbl(PP, ~sum(as.numeric(.x)))) %>%
  mutate(BA1 = as.numeric(str_remove_all(BA1, ' BS'))) %>%
  mutate(BS = str_remove_all(BS, ' BP|\\[|\\]'),
         BS = str_split(BS, ","),
         BS = map_dbl(BS, ~sum(as.numeric(.x)))) %>%
  mutate(BP = str_remove_all(BP, '\\[|\\]'),
         BP = str_split(BP, ","),
         BP = map_dbl(BP, ~sum(as.numeric(.x))))

                     Col1 PVS PS PM PP BA1 BS BP
1  Uncertain significance   0  0  1  0   0  0  0
2              Pathogenic   1  0  1  1   0  0  0
3              Pathogenic   1  0  1  1   0  0  0
4              Pathogenic   1  0  1  1   0  0  0
5              Pathogenic   1  0  1  1   0  0  0
6           Likely benign   0  0  1  0   0  0  2
7              Pathogenic   1  0  1  1   0  0  0
8              Pathogenic   1  0  1  1   0  0  0

where df is

x <- c(" Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] ",
       " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] " ,           
       " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "  ,          
       " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "  ,          
       " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "   ,         
       " Likely benign PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 1, 0, 0, 1, 0] " ,        
       " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "   ,         
       " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "  )

df <- data.frame(x = x)

Upvotes: 0

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

Here's a stringr approach in steps:

library(stringr)
  • Step 1: Extract PVS1 values: This is the easiest step as there seems to be just one value each: first get rid of all the [and ]characters, then use lookbehind to locate the value in question:

    df$PVS1 <- str_extract(gsub("\\[|\\]", "", df$Intervar, perl = T), "(?<=PVS1=)\\d")
    
  • Step 2 (and 3, 4, 5, etc.): Extract PS values: This step, which is in principle repeated for all other values, is more complicated as we're dealing with a multi-character string:

  • step 2.1: extract the whole string while trimming whitespace on either end of the string:

    str1 <- trimws(str_extract(gsub("\\[|\\]", "", df$Intervar, perl = T), "(?<=PS=)[0-9, ]+"))
    
  • step 2.2: strsplit the extracted strings at ,and convert the pieces to numeric:

    str1_num <- lapply(strsplit(str1, ","), function(x) as.numeric(x))
    
  • step 2.3: now create the new column in dfwith the summed values:

    df$PS <- lapply(str1_num, sum)
    

Result so far (after Step 2):

df
                                                                                                                                                Intervar
1  Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] 
2              Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] 
  PVS1 PS
1    0  0
2    1  0

For the other values, just adapt the code appropriately!

Data:

df <- data.frame(
  Intervar = c(" Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] ",
               " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] ") 
)

Upvotes: 1

Related Questions