hsl
hsl

Reputation: 681

How to select specific columns containing certain strings/characters?

I have this dataframe:

df1 <- data.frame(a = c("correct", "wrong", "wrong", "correct"),
  b = c(1, 2, 3, 4),
  c = c("wrong", "wrong", "wrong", "wrong"),
  d = c(2, 2, 3, 4))

a       b c     d
correct 1 wrong 2
wrong   2 wrong 2
wrong   3 wrong 3
correct 4 wrong 4

and would like to select only the columns with either the strings 'correct' or 'wrong' (i.e., columns b and d in df1), such that I get this dataframe:

df2 <- data.frame(a = c("correct", "wrong", "wrong", "correct"),
        c = c("wrong", "wrong", "wrong", "wrong"))

        a     c
1 correct wrong
2   wrong wrong
3   wrong wrong
4 correct wrong

Can I use dplyr to do this? If not, what function(s) can I use to do this? The example I've given is straightforward, in that I can just do this (dplyr):

select(df1, a, c)

However, in my actual dataframe, I have about 700 variables/columns and a few hundred columns that contain the strings 'correct' or 'wrong' and I don't know the variable/column names.

Any suggestions as to how to do this quickly? Thanks a lot!

Upvotes: 7

Views: 4753

Answers (2)

Pierre L
Pierre L

Reputation: 28461

---- update ----- Thanks Colonel Beavel. What an elegant solution. I will def use Filter more.

I want to check a speed solution too just in case time is an important factor:

locator <- apply(df1, 2, function(x) grepl("correct|wrong", x))
index <- apply(locator, 2, any)
newdf <- df1[,!index]

I expanded your data frame to 500,000 columns:

dftest <- as.data.frame(replicate(500000, df1[,1]))

Then tested the system time for a function with apply, Filter with grepl, and Filter with pattern %in%:

f <- function() {
locator <- apply(dftest, 2, function(x) grepl("correct|wrong", x))
index <- apply(locator, 2, any)
newdf <- dftest[,!index]
}

f1 <- function() {newdf <- (Filter(function(x) any(c("wrong", "correct") %in% x), dftest))}

f2 <- function() {newdf <- Filter(function(u) any(grepl('wrong|correct',u)), dftest)}


system.time(f())
   user  system elapsed 
   24.32    0.00   24.35 
system.time(f1())
   user  system elapsed 
   2.31    0.00    2.34 
system.time(f2())
   user  system elapsed 
   8.66    0.01    8.71 

Colonel's solution is by far the best one. It's clean and performs best. --credit @akrun for data.frame suggestion.

Upvotes: 2

Colonel Beauvel
Colonel Beauvel

Reputation: 31181

You can use base R Filter which will operate on each of df1's columns and keep all ones satisfying the logical test in the function:

Filter(function(u) any(c('wrong','correct') %in% u), df1)
#        a     c
#1 correct wrong
#2   wrong wrong
#3   wrong wrong
#4 correct wrong

You can also use grepl:

Filter(function(u) any(grepl('wrong|correct',u)), df1)

Upvotes: 9

Related Questions