knb
knb

Reputation: 9295

How to perform a separate() followed by a mutate_each() with dplyr

I have data in an sqlite database that contains an entity that is not in first normal form. The strings in column 'sample_attribute' look like this:

 isolate: R4166 || age: 43.88 || biomaterial_provider: LIBD || sex: male || tissue: DLPFC || disease: control || race: AA || RIN: 8.7 || Fraction: total || BioSampleModel: Human

My code at this time:

library(tidyr)
library(dplyr)
library(stringi)



rs.df <- structure(list(run_accession = c("SRR1554537", "SRR2071348"), 
platform_parameters = c("INSTRUMENT_MODEL: Illumina HiSeq 2000", 
"INSTRUMENT_MODEL: Illumina HiSeq 2000"), sample_attribute = c("isolate: R3452 || age: -0.3836 || biomaterial_provider: LIBD || sex: female || tissue: DLPFC || disease: control || race: AA || RIN: 9.6 || Fraction: total || BioSampleModel: Human", "isolate: R3452 || age: -0.3836 || biomaterial_provider: LIBD || sex: female || tissue: DLPFC || disease: control || race: AA || RIN: 9.6 || Fraction: total || BioSampleModel: Human")), .Names = c("run_accession", "platform_parameters", "sample_attribute"
), row.names = c(NA, -2L), class = "data.frame")

coln <- c("isolate", "age", "biomaterial_provider", "sex", "tissue", "disease", "race",
          "RIN", "Fraction", "BioSampleModel")

rs.df <- rs.df %>%
        separate(sample_attribute, coln, sep = "\\|\\|")

head(rs.df, 1)

Intermediate Result:

       sample_attribute
  run_accession                   platform_parameters         isolate          age
1    SRR1554534 INSTRUMENT_MODEL: Illumina HiSeq 2000 isolate: DLPFC   age: 40.42 
          biomaterial_provider         sex          tissue            disease
1  biomaterial_provider: LIBD   sex: male   tissue: DLPFC   disease: Control 
        race        RIN          Fraction         BioSampleModel
1  race: AA   RIN: 8.4   Fraction: total   BioSampleModel: Human

Currently I continue with

for (x in coln){
        rs.df[,x] <- stri_replace(rs.df[,x], regex = "^.+:\\s*", replacement = "")
}

but that's inflexible.

Is there a way to extend the dplyr pipeline such that the for-loop is being replaced (as far as possible) with calls in the %>% pipeline?

At least, for values of the columns in coln, remove the strings until the colon from the result of the separate() call :

rs.df <- rs.df %>%
        separate(sample_attribute, coln, sep = "\\|\\|") %>%
        mutate_each(... stri_replace...) #split pairs at ":", remove part before ":"

(Here the for-loop has solved my problem of separating/cleaning up the strings. However, there are probably more such columns in the SRAdb database with key:valuepairs separated by "||". How to process them in a more flexible way?)

Upvotes: 0

Views: 127

Answers (1)

chinsoon12
chinsoon12

Reputation: 25225

check out the answer by @docendo discimus here: dplyr certain columns

in your case

rs.df <- rs.df %>%
    separate(sample_attribute, coln, sep = "\\|\\|") %>%
    mutate_each_(funs(stri_replace(., regex="^.+:\\s*", replacement="")), coln)

Upvotes: 1

Related Questions