Reputation: 599
I have a dataset, a column of which was like this:
x$Intervar
[1] " Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "
[2] " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "
[3] " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "
[4] " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "
[5] " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "
[6] " Likely benign PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 1, 0, 0, 1, 0] "
[7] " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "
[8] " Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "
Each row follows the same pattern:
Pathogenic/Likely pathogenic/Uncertain significance/Likely benign/Benign PVS1=0 PS=[0,0,0,0,0] PM=[0,0,0,0,0,0,0] PP=[0,0,0,0,0,0] BA1=0 BS=[0,0,0,0,0] BP=[0,0,0,0,0,0,0,0]
I want to split it into multiple columns and calculate the sum of the number in the bracket.
x
PVS1 PS PM PP BA1 BS BP
1 0 0 1 0 0 0 0
2 1 0 1 1 0 0 0
3 1 0 1 1 0 0 0
4 1 0 1 1 0 0 0
5 1 0 1 1 0 0 0
6 0 0 1 0 0 0 2
7 1 0 1 1 0 0 0
8 1 0 1 1 0 0 0
Upvotes: 2
Views: 702
Reputation: 269644
1) Convert the lines to dcf format by
gsubfn
andFinally use read.dcf
to create a character matrix, convert that to a data frame and use type.convert
to convert the columns containing numbers to numeric and the first column to factor. If character rather than factor is preferred add the as.is=TRUE
argument to type.convert
.
If you don't want the first column add %>% .[-1]
to the end of the pipeline.
library(gsubfn)
library(magrittr)
x$Intervar %>%
gsub(" (\\w+=)", "\n\\1", .) %>%
paste0("\nDescription=", .) %>%
gsubfn("\\[(.*?)\\]", ~ sum(scan(text = txt, sep = ",", quiet = TRUE)), .) %>%
gsub("=", ": ", .) %>%
textConnection %>%
read.dcf %>%
as.data.frame %>%
type.convert
giving this data frame:
Description PVS1 PS PM PP BA1 BS BP
1 Uncertain significance 0 0 1 0 0 0 0
2 Pathogenic 1 0 1 1 0 0 0
3 Pathogenic 1 0 1 1 0 0 0
4 Pathogenic 1 0 1 1 0 0 0
5 Pathogenic 1 0 1 1 0 0 0
6 Likely benign 0 0 1 0 0 0 2
7 Pathogenic 1 0 1 1 0 0 0
8 Pathogenic 1 0 1 1 0 0 0
2) This is a base R approach. We remove everything before PSV1= and replace [ with space, ] with comma, comma with + and insert a comma after fields with no [...]. Then surround that with c(...) so at this point each line looks like:
"c( PVS1=0, PS= 0+ 0+ 0+ 0+ 0, PM= 0+ 1+ 0+ 0+ 0+ 0+ 0, PP= 0+ 0+ 0+ 0+ 0+ 0, BA1=0, BS= 0+ 0+ 0+ 0+ 0, BP= 0+ 0+ 0+ 0+ 0+ 0+ 0+ 0, NULL)"
Now parse and eval each line creating a numeric vector out of each such string and rbind those together. We used the Bizarro pipe which only uses base R.
x$Intervar ->.;
sub(".*PVS1=", "PVS1=", .) ->.;
chartr("[],", " ,+", .) ->.;
gsub("=(\\d)", "=\\1,", .) ->.;
paste("c(", ., "NULL)") ->.;
lapply(., function(x) eval(parse(text = x))) ->.;
do.call("rbind", .)
giving this numeric matrix:
PVS1 PS PM PP BA1 BS BP
[1,] 0 0 1 0 0 0 0
[2,] 1 0 1 1 0 0 0
[3,] 1 0 1 1 0 0 0
[4,] 1 0 1 1 0 0 0
[5,] 1 0 1 1 0 0 0
[6,] 0 0 1 0 0 0 2
[7,] 1 0 1 1 0 0 0
[8,] 1 0 1 1 0 0 0
The input in reproducible form is:
x <-
structure(list(Intervar = c(" Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] ",
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] ",
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] ",
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] ",
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] ",
" Likely benign PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 1, 0, 0, 1, 0] ",
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] ",
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] "
)), class = "data.frame", row.names = c(NA, -8L))
Upvotes: 3
Reputation: 26218
Your strings are a lot messy, still you can do the following
cols = c("Col1", "PVS", "PS", "PM", "PP", "BA1", "BS", "BP")
df %>% separate(x, sep = '=', into = cols) %>%
mutate(Col1 = str_remove(Col1, ' PVS1')) %>%
mutate(PVS = str_remove_all(PVS, ' PS')) %>%
mutate(PS = str_remove_all(PS, ' PM|\\[|\\]'),
PS = str_split(PS, ","),
PS = map_dbl(PS, ~sum(as.numeric(.x)))) %>%
mutate(PM = str_remove_all(PM, ' PP|\\[|\\]'),
PM = str_split(PM, ","),
PM = map_dbl(PM, ~sum(as.numeric(.x)))) %>%
mutate(PP = str_remove_all(PP, ' BA1|\\[|\\]'),
PP = str_split(PP, ","),
PP = map_dbl(PP, ~sum(as.numeric(.x)))) %>%
mutate(BA1 = as.numeric(str_remove_all(BA1, ' BS'))) %>%
mutate(BS = str_remove_all(BS, ' BP|\\[|\\]'),
BS = str_split(BS, ","),
BS = map_dbl(BS, ~sum(as.numeric(.x)))) %>%
mutate(BP = str_remove_all(BP, '\\[|\\]'),
BP = str_split(BP, ","),
BP = map_dbl(BP, ~sum(as.numeric(.x))))
Col1 PVS PS PM PP BA1 BS BP
1 Uncertain significance 0 0 1 0 0 0 0
2 Pathogenic 1 0 1 1 0 0 0
3 Pathogenic 1 0 1 1 0 0 0
4 Pathogenic 1 0 1 1 0 0 0
5 Pathogenic 1 0 1 1 0 0 0
6 Likely benign 0 0 1 0 0 0 2
7 Pathogenic 1 0 1 1 0 0 0
8 Pathogenic 1 0 1 1 0 0 0
where df
is
x <- c(" Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] ",
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] " ,
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] " ,
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] " ,
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] " ,
" Likely benign PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 1, 0, 0, 1, 0] " ,
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] " ,
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] " )
df <- data.frame(x = x)
Upvotes: 0
Reputation: 21400
Here's a stringr
approach in steps:
library(stringr)
Step 1: Extract PVS1
values:
This is the easiest step as there seems to be just one value each: first get rid of all the [
and ]
characters, then use lookbehind to locate the value in question:
df$PVS1 <- str_extract(gsub("\\[|\\]", "", df$Intervar, perl = T), "(?<=PVS1=)\\d")
Step 2 (and 3, 4, 5, etc.): Extract PS
values:
This step, which is in principle repeated for all other values, is more complicated as we're dealing with a multi-character string:
step 2.1: extract the whole string while trimming whitespace on either end of the string:
str1 <- trimws(str_extract(gsub("\\[|\\]", "", df$Intervar, perl = T), "(?<=PS=)[0-9, ]+"))
step 2.2: strsplit
the extracted strings at ,
and convert the pieces to numeric:
str1_num <- lapply(strsplit(str1, ","), function(x) as.numeric(x))
step 2.3: now create the new column in df
with the summed values:
df$PS <- lapply(str1_num, sum)
Result so far (after Step 2):
df
Intervar
1 Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]
2 Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]
PVS1 PS
1 0 0
2 1 0
For the other values, just adapt the code appropriately!
Data:
df <- data.frame(
Intervar = c(" Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] ",
" Pathogenic PVS1=1 PS=[0, 0, 0, 0, 0] PM=[0, 1, 0, 0, 0, 0, 0] PP=[0, 0, 1, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0] ")
)
Upvotes: 1