Reputation: 555

Succinct subsetting across multiple columns in R

Say I have a massive dataframe and in multiple columns I have an extremely large list of unique codes and I want to use these codes to select certain rows to subset the original dataframe. There are around 1000 codes and the codes I want all follow after each other. For example I have about 30 columns that contain codes and I only want to take rows that have codes 100 to 120 in ANY of these columns .

There's a long way to do this which is something like

new_dat <- df[which(df$codes==100 | df$codes==101 | df$codes1==100

and I repeat this for every single possible code for everyone of the columns that can contain these codes. Is there a way to do this in a more convenient fashion?

I want to try solving this with dplyr's select function, but I'm having trouble seeing if it works for my case out of the box

Take the iris dataset

Say I wanted all rows that contain the value 4.0-5.0 in any columns that contains the word Sepal in the column name.

#this only goes for 4.0

brand_new_df <- select(filter(iris, Sepal.Length ==4.0 | Sepal.Width == 4.0))

but what I want is something like

brand_new_df <- select(filter(iris, contains(Sepal) == 4.0:5.0))

Is there a dplyr way to do this?

Upvotes: 2

Answers (5)

Darren Tsai

Reputation: 35584

A corresponding across() version from @RonakShah's answer:

library(dplyr)

iris %>% filter(rowSums(across(contains('Sepal'), ~ between(., 4, 5))) > 0)

iris %>% filter(rowSums(across(contains('Sepal'), between, 4, 5)) > 0)

From vignette("colwise"):

Previously, filter() was paired with the all_vars() and any_vars() helpers. Now, across() is equivalent to all_vars(), and there’s no direct replacement for any_vars().

So you need something like rowSums(...) > 0 to achieve the effect of any_vars().

Upvotes: 3

tmfmnk

Reputation: 39858

One option could be:

iris %>%
 filter(Reduce(`|`, across(contains("Sepal"), ~ between(.x, 4, 5))))

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           4.9         3.0          1.4         0.2       1
2           4.7         3.2          1.3         0.2       1
3           4.6         3.1          1.5         0.2       1
4           5.0         3.6          1.4         0.2       1
5           4.6         3.4          1.4         0.3       1
6           5.0         3.4          1.5         0.2       1
7           4.4         2.9          1.4         0.2       1
8           4.9         3.1          1.5         0.1       1
9           4.8         3.4          1.6         0.2       1
10          4.8         3.0          1.4         0.1       1

Upvotes: 1

Waldi

Reputation: 41230

library(dplyr)
df <- iris
# value to look for
val <- 4 
# find columns
cols <- which(colSums(df == val , na.rm = TRUE) > 0L)
# filter rows
iris %>%  filter_at(cols, any_vars(.==val))
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          5.8         4.0          1.2         0.2     setosa
2          5.5         2.3          4.0         1.3 versicolor
3          6.0         2.2          4.0         1.0 versicolor
4          6.1         2.8          4.0         1.3 versicolor
5          5.5         2.5          4.0         1.3 versicolor
6          5.8         2.6          4.0         1.2 versicolor

Upvotes: 0

hello_friend

Reputation: 5788

Base R:

# Subset: 
cols <- grep("codes", names(df2), value = TRUE)
df2[rowSums(sapply(cols,
                   function(x) {
                     df2[, x] >= 100 & df2[, x] <= 120
                   })) == length(cols), ]
# Data: 
tmp <- data.frame(x1 <- rnorm(999, mean = 100, sd = 2))
df <-
  setNames(data.frame(tmp[rep(1, each = 80)]), paste0("codes", 1:80))
df2 <- cbind(id = 1:nrow(df), df)

Upvotes: 1

Ronak Shah

Reputation: 389055

You can use filter_at :

library(dplyr)
iris %>%  filter_at(vars(contains('Sepal')), any_vars(between(., 4, 5)))

#   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#1           4.9         3.0          1.4         0.2     setosa
#2           4.7         3.2          1.3         0.2     setosa
#3           4.6         3.1          1.5         0.2     setosa
#4           5.0         3.6          1.4         0.2     setosa
#5           4.6         3.4          1.4         0.3     setosa
#6           5.0         3.4          1.5         0.2     setosa
#7           4.4         2.9          1.4         0.2     setosa
#....

Upvotes: 2

Succinct subsetting across multiple columns in R

Answers (5)

Related Questions