Reputation: 7774
I spent about 20 minutes looking through previous questions, but could not find what I am looking for. I have a large data frame I want to subset down based on a list of names, but the names in the data frame can also have a postfix not indicated in the list.
In other words, is there a simpler generic way (for infinite numbers of postfixes) to do the following:
data <- data.frame("name"=c("name1","name1_post1","name2","name2_post1",
"name2_post2","name3","name4"),
"data"=rnorm(7,0,1),
stringsAsFactors=FALSE)
names <- c("name2","name3")
subset <- data[ data$name %in% names | data$name %in% paste0(names,"_post1") | data$name %in% paste0(names,"_post2") , ]
In response to @Arun's answer. The names in my data actually include more than one underscore, making the problem more complicated.
data <- data.frame("name"=c("name1_target_time","name1_target_time_post1","name2_target_time","name2_target_time_post1",
"name2_target_time_post2","name3_target_time","name4_target_time"),
"data"=rnorm(7,0,1),
stringsAsFactors=FALSE)
names <- c("name2_target_time","name3_target_time")
subset <- data[ data$name %in% names | data$name %in% paste0(names,"_post1") | data$name %in% paste0(names,"_post2") , ]
Upvotes: 1
Views: 292
Reputation: 118779
Edit: solution using regular expressions (following OP's follow-up in comment):
data[grepl(paste(names, collapse="|"), data$name), ]
# name data
# 3 name2 1.4934931
# 4 name2_post1 -1.6070809
# 5 name2_post2 -0.4157518
# 6 name3 0.4220084
On your new data:
# name data
# 3 name2_target_time 0.6295361
# 4 name2_target_time_post1 0.8951720
# 5 name2_target_time_post2 0.6602126
# 6 name3_target_time 2.2734835
Also, as @flodel shows under comments, this also works fine!
subset(data, sub("_post\\d+$", "", name) %in% names)
Old solution:
data[sapply(strsplit(data$name, "_"), "[[", 1) %in% names, ]
# name data
# 3 name2 1.4934931
# 4 name2_post1 -1.6070809
# 5 name2_post2 -0.4157518
# 6 name3 0.4220084
The idea: First split
the string at _
using strsplit
. This results in a list. For ex: name2
will result in just name2
(first element of the list). But name2_post1
will result in name2
and post1
(second element of the list). By wrapping it with sapply
and using [[
with 1
, we can select just the "first" element of this resulting list. Then we can use that with %in%
to check if they are present in names
(which is straightforward).
Upvotes: 3
Reputation: 488
A grep solution would probably look something like the following:
subset <- data[grep("(name2)|(name3)",names(data)),]
Upvotes: 0