Reputation: 681
I have this dataframe:
df1 <- data.frame(a = c("correct", "wrong", "wrong", "correct"),
b = c(1, 2, 3, 4),
c = c("wrong", "wrong", "wrong", "wrong"),
d = c(2, 2, 3, 4))
a b c d
correct 1 wrong 2
wrong 2 wrong 2
wrong 3 wrong 3
correct 4 wrong 4
and would like to select only the columns with either the strings 'correct' or 'wrong' (i.e., columns b and d in df1), such that I get this dataframe:
df2 <- data.frame(a = c("correct", "wrong", "wrong", "correct"),
c = c("wrong", "wrong", "wrong", "wrong"))
a c
1 correct wrong
2 wrong wrong
3 wrong wrong
4 correct wrong
Can I use dplyr to do this? If not, what function(s) can I use to do this? The example I've given is straightforward, in that I can just do this (dplyr):
select(df1, a, c)
However, in my actual dataframe, I have about 700 variables/columns and a few hundred columns that contain the strings 'correct' or 'wrong' and I don't know the variable/column names.
Any suggestions as to how to do this quickly? Thanks a lot!
Upvotes: 7
Views: 4753
Reputation: 28461
---- update -----
Thanks Colonel Beavel. What an elegant solution. I will def use Filter
more.
I want to check a speed solution too just in case time is an important factor:
locator <- apply(df1, 2, function(x) grepl("correct|wrong", x))
index <- apply(locator, 2, any)
newdf <- df1[,!index]
I expanded your data frame to 500,000 columns:
dftest <- as.data.frame(replicate(500000, df1[,1]))
Then tested the system time for a function with apply
, Filter
with grepl, and Filter
with pattern %in%:
f <- function() {
locator <- apply(dftest, 2, function(x) grepl("correct|wrong", x))
index <- apply(locator, 2, any)
newdf <- dftest[,!index]
}
f1 <- function() {newdf <- (Filter(function(x) any(c("wrong", "correct") %in% x), dftest))}
f2 <- function() {newdf <- Filter(function(u) any(grepl('wrong|correct',u)), dftest)}
system.time(f())
user system elapsed
24.32 0.00 24.35
system.time(f1())
user system elapsed
2.31 0.00 2.34
system.time(f2())
user system elapsed
8.66 0.01 8.71
Colonel's solution is by far the best one. It's clean and performs best. --credit @akrun for data.frame suggestion.
Upvotes: 2
Reputation: 31181
You can use base R
Filter
which will operate on each of df1
's columns and keep all ones satisfying the logical test in the function:
Filter(function(u) any(c('wrong','correct') %in% u), df1)
# a c
#1 correct wrong
#2 wrong wrong
#3 wrong wrong
#4 correct wrong
You can also use grepl
:
Filter(function(u) any(grepl('wrong|correct',u)), df1)
Upvotes: 9