codeby
codeby

Reputation: 33

Extracting parts of a row name to make a new column in a data frame in R

I have a data frame in R called cryptdeltact that contains sample information as follows

# A tibble: 2,293 x 7
# Groups:   Name [72]
   Name          Detector      N   Value    sd    se    ci
   <fct>         <fct>     <dbl>   <dbl> <dbl> <dbl> <dbl>
 1 VG 2H 1 SB1 C ATM           6 11.4    0.653 0.267 0.686
 2 VG 2H 1 SB1 C BetaActin     6  0.0199 0.588 0.240 0.617
 3 VG 2H 1 SB1 C BMPR1a        6  6.49   0.591 0.241 0.620
 4 VG 2H 1 SB1 C BMPR2         6  7.19   0.614 0.251 0.645
 5 VG 2H 1 SB1 C Brca1         6 11.5    0.640 0.261 0.672
 6 VG 2H 1 SB1 C Brca2         6 11.9    0.840 0.343 0.882
 7 VG 2H 1 SB1 C cmyc          6  8.20   0.580 0.237 0.608
 8 VG 2H 1 SB1 C DNAPKCs       6 11.5    0.651 0.266 0.683
 9 VG 2H 1 SB1 C Ercc1         6 11.4    0.783 0.320 0.822
10 VG 2H 1 SB1 C Fen1          6  9.05   0.629 0.257 0.660
# … with 2,283 more rows

I want to add three new columns to this data frame: Model, Time and Region. All of the info for these new columns is contained somewhere in the existing 'Name' column. Time is the second piece of information in 'Name' ie. '0h', '2h' or '5h'. Region is the second last ie 'SB1', 'SB2', 'SB3' or 'SB4'. But Model is a combination of the first two letters and the last letter ie. 'VG C' or 'VG V' or 'WT C' or 'WT V'. I know the answer lies in extracting the appropriate info from the Name string and putting it into a new column but I am struggling with the syntax.

The final table columns would ideally look like this (once extracted I can change 'VG V' to 'VG Villus' and remove the Name column entirely)

   Model      Time Region Detector     N     sd     se    ci
   <chr>     <dbl> <chr>  <chr>    <dbl>  <dbl>  <dbl> <dbl>
 1 VG Villus     0 SB1    Fen1         1 NA     NA     NA   
 2 VG Villus     0 SB1    Lig3         1 NA     NA     NA   
 3 VG Villus     0 SB1    PARP1        1 NA     NA     NA   
 4 VG Villus     0 SB1    PolTheta     1 NA     NA     NA   
 5 VG Villus     0 SB1    WRN          1 NA     NA     NA   
 6 VG Villus     2 SB1    Fen1         3  1.22   0.706  3.04
 7 VG Villus     2 SB1    Lig3         3  2.11   1.22   5.25
 8 VG Villus     2 SB1    Mre11a       3  0.601  0.347  1.49
 9 VG Villus     2 SB1    PARP1        3  1.94   1.12   4.82
10 VG Villus     2 SB1    PolTheta     3  2.74   1.58   6.82

Apologies for the base question but I'm sure this could take up far less time than it currently is!

Upvotes: 2

Views: 1040

Answers (2)

arranjdavis
arranjdavis

Reputation: 735

This should work, and it just uses base R. Also, I am going to give you a bonus, and provide you with a numerical time variable (I think this is what you want?).

(assuming your data frame is called data)

#string split to create a list of all names
split_col = strsplit(as.character(data$Name), " ")

#create the lists for each new variable
time_var = c()
region_var = c()
model_var = c()

#create a counter for the for loop
i = 1

#go through all the name strings
for (s in split_col){

  #add to the lists
  time_var[[i]] = s[2]
  region_var[[i]] = s[4]
  model_var[[i]] = paste(s[1], s[5])

  #add to the counter
  i = i + 1
}

#add these lists to the dataset
data$model = model_var
data$region = region_var
data$time = time_var

#make the time variable numeric
data$time_numeric = ifelse(data$time == '2H', 2, ifelse(data$time == '5H', 5, ifelse(data$time == '0H', 0, NA)))

Hope that works!

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388982

We can use tidyr extract with appropriate regex and then unite the column

library(tidyr)

extract(df, Name, into = c("Model", "Time", "Region", "temp"), 
           regex = "(.*)(\\d)H.*(SB\\d).*([A-Z])$") %>%
unite(Model, Model, temp, sep = "")

#   Model Time Region  Detector N   Value    sd    se    ci
#1   VG C    2    SB1       ATM 6 11.4000 0.653 0.267 0.686
#2   VG C    2    SB1 BetaActin 6  0.0199 0.588 0.240 0.617
#3   VG C    2    SB1    BMPR1a 6  6.4900 0.591 0.241 0.620
#4   VG C    2    SB1     BMPR2 6  7.1900 0.614 0.251 0.645
#5   VG C    2    SB1     Brca1 6 11.5000 0.640 0.261 0.672
#6   VG C    2    SB1     Brca2 6 11.9000 0.840 0.343 0.882
#7   VG C    2    SB1      cmyc 6  8.2000 0.580 0.237 0.608
#8   VG C    2    SB1   DNAPKCs 6 11.5000 0.651 0.266 0.683
#9   VG C    2    SB1     Ercc1 6 11.4000 0.783 0.320 0.822
#10  VG C    2    SB1      Fen1 6  9.0500 0.629 0.257 0.660

data

df <- structure(list(Name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = "VG 2H 1 SB1 C", class = "factor"), Detector = 
structure(1:10, .Label = c("ATM", "BetaActin", "BMPR1a", "BMPR2", "Brca1", "Brca2", 
"cmyc", "DNAPKCs", "Ercc1", "Fen1"), class = "factor"), N = c(6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L), Value = c(11.4, 0.0199, 6.49, 7.19, 11.5, 
11.9, 8.2, 11.5, 11.4, 9.05), sd = c(0.653, 0.588, 0.591, 0.614, 
0.64, 0.84, 0.58, 0.651, 0.783, 0.629), se = c(0.267, 0.24, 0.241, 
0.251, 0.261, 0.343, 0.237, 0.266, 0.32, 0.257), ci = c(0.686, 
0.617, 0.62, 0.645, 0.672, 0.882, 0.608, 0.683, 0.822, 0.66)), 
class = "data.frame", row.names = c(NA, -10L))

Upvotes: 1

Related Questions