Reputation: 33
I have a data frame in R called cryptdeltact that contains sample information as follows
# A tibble: 2,293 x 7
# Groups: Name [72]
Name Detector N Value sd se ci
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 VG 2H 1 SB1 C ATM 6 11.4 0.653 0.267 0.686
2 VG 2H 1 SB1 C BetaActin 6 0.0199 0.588 0.240 0.617
3 VG 2H 1 SB1 C BMPR1a 6 6.49 0.591 0.241 0.620
4 VG 2H 1 SB1 C BMPR2 6 7.19 0.614 0.251 0.645
5 VG 2H 1 SB1 C Brca1 6 11.5 0.640 0.261 0.672
6 VG 2H 1 SB1 C Brca2 6 11.9 0.840 0.343 0.882
7 VG 2H 1 SB1 C cmyc 6 8.20 0.580 0.237 0.608
8 VG 2H 1 SB1 C DNAPKCs 6 11.5 0.651 0.266 0.683
9 VG 2H 1 SB1 C Ercc1 6 11.4 0.783 0.320 0.822
10 VG 2H 1 SB1 C Fen1 6 9.05 0.629 0.257 0.660
# … with 2,283 more rows
I want to add three new columns to this data frame: Model, Time and Region. All of the info for these new columns is contained somewhere in the existing 'Name' column. Time is the second piece of information in 'Name' ie. '0h', '2h' or '5h'. Region is the second last ie 'SB1', 'SB2', 'SB3' or 'SB4'. But Model is a combination of the first two letters and the last letter ie. 'VG C' or 'VG V' or 'WT C' or 'WT V'. I know the answer lies in extracting the appropriate info from the Name string and putting it into a new column but I am struggling with the syntax.
The final table columns would ideally look like this (once extracted I can change 'VG V' to 'VG Villus' and remove the Name column entirely)
Model Time Region Detector N sd se ci
<chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 VG Villus 0 SB1 Fen1 1 NA NA NA
2 VG Villus 0 SB1 Lig3 1 NA NA NA
3 VG Villus 0 SB1 PARP1 1 NA NA NA
4 VG Villus 0 SB1 PolTheta 1 NA NA NA
5 VG Villus 0 SB1 WRN 1 NA NA NA
6 VG Villus 2 SB1 Fen1 3 1.22 0.706 3.04
7 VG Villus 2 SB1 Lig3 3 2.11 1.22 5.25
8 VG Villus 2 SB1 Mre11a 3 0.601 0.347 1.49
9 VG Villus 2 SB1 PARP1 3 1.94 1.12 4.82
10 VG Villus 2 SB1 PolTheta 3 2.74 1.58 6.82
Apologies for the base question but I'm sure this could take up far less time than it currently is!
Upvotes: 2
Views: 1040
Reputation: 735
This should work, and it just uses base R. Also, I am going to give you a bonus, and provide you with a numerical time variable (I think this is what you want?).
(assuming your data frame is called data
)
#string split to create a list of all names
split_col = strsplit(as.character(data$Name), " ")
#create the lists for each new variable
time_var = c()
region_var = c()
model_var = c()
#create a counter for the for loop
i = 1
#go through all the name strings
for (s in split_col){
#add to the lists
time_var[[i]] = s[2]
region_var[[i]] = s[4]
model_var[[i]] = paste(s[1], s[5])
#add to the counter
i = i + 1
}
#add these lists to the dataset
data$model = model_var
data$region = region_var
data$time = time_var
#make the time variable numeric
data$time_numeric = ifelse(data$time == '2H', 2, ifelse(data$time == '5H', 5, ifelse(data$time == '0H', 0, NA)))
Hope that works!
Upvotes: 1
Reputation: 388982
We can use tidyr
extract
with appropriate regex
and then unite
the column
library(tidyr)
extract(df, Name, into = c("Model", "Time", "Region", "temp"),
regex = "(.*)(\\d)H.*(SB\\d).*([A-Z])$") %>%
unite(Model, Model, temp, sep = "")
# Model Time Region Detector N Value sd se ci
#1 VG C 2 SB1 ATM 6 11.4000 0.653 0.267 0.686
#2 VG C 2 SB1 BetaActin 6 0.0199 0.588 0.240 0.617
#3 VG C 2 SB1 BMPR1a 6 6.4900 0.591 0.241 0.620
#4 VG C 2 SB1 BMPR2 6 7.1900 0.614 0.251 0.645
#5 VG C 2 SB1 Brca1 6 11.5000 0.640 0.261 0.672
#6 VG C 2 SB1 Brca2 6 11.9000 0.840 0.343 0.882
#7 VG C 2 SB1 cmyc 6 8.2000 0.580 0.237 0.608
#8 VG C 2 SB1 DNAPKCs 6 11.5000 0.651 0.266 0.683
#9 VG C 2 SB1 Ercc1 6 11.4000 0.783 0.320 0.822
#10 VG C 2 SB1 Fen1 6 9.0500 0.629 0.257 0.660
data
df <- structure(list(Name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = "VG 2H 1 SB1 C", class = "factor"), Detector =
structure(1:10, .Label = c("ATM", "BetaActin", "BMPR1a", "BMPR2", "Brca1", "Brca2",
"cmyc", "DNAPKCs", "Ercc1", "Fen1"), class = "factor"), N = c(6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L), Value = c(11.4, 0.0199, 6.49, 7.19, 11.5,
11.9, 8.2, 11.5, 11.4, 9.05), sd = c(0.653, 0.588, 0.591, 0.614,
0.64, 0.84, 0.58, 0.651, 0.783, 0.629), se = c(0.267, 0.24, 0.241,
0.251, 0.261, 0.343, 0.237, 0.266, 0.32, 0.257), ci = c(0.686,
0.617, 0.62, 0.645, 0.672, 0.882, 0.608, 0.683, 0.822, 0.66)),
class = "data.frame", row.names = c(NA, -10L))
Upvotes: 1