Reputation: 555
Say I have a massive dataframe and in multiple columns I have an extremely large list of unique codes and I want to use these codes to select certain rows to subset the original dataframe. There are around 1000 codes and the codes I want all follow after each other. For example I have about 30 columns that contain codes and I only want to take rows that have codes 100 to 120 in ANY of these columns .
There's a long way to do this which is something like
new_dat <- df[which(df$codes==100 | df$codes==101 | df$codes1==100
and I repeat this for every single possible code for everyone of the columns that can contain these codes. Is there a way to do this in a more convenient fashion?
I want to try solving this with dplyr
's select
function, but I'm having trouble seeing if it works for my case out of the box
Take the iris
dataset
Say I wanted all rows that contain the value 4.0-5.0 in any columns that contains the word Sepal in the column name.
#this only goes for 4.0
brand_new_df <- select(filter(iris, Sepal.Length ==4.0 | Sepal.Width == 4.0))
but what I want is something like
brand_new_df <- select(filter(iris, contains(Sepal) == 4.0:5.0))
Is there a dplyr way to do this?
Upvotes: 2
Views: 138
Reputation: 35584
A corresponding across()
version from @RonakShah's answer:
library(dplyr)
iris %>% filter(rowSums(across(contains('Sepal'), ~ between(., 4, 5))) > 0)
or
iris %>% filter(rowSums(across(contains('Sepal'), between, 4, 5)) > 0)
From vignette("colwise")
:
Previously,
filter()
was paired with theall_vars()
andany_vars()
helpers. Now,across()
is equivalent toall_vars()
, and there’s no direct replacement forany_vars()
.
So you need something like rowSums(...) > 0
to achieve the effect of any_vars()
.
Upvotes: 3
Reputation: 39858
One option could be:
iris %>%
filter(Reduce(`|`, across(contains("Sepal"), ~ between(.x, 4, 5))))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 3.0 1.4 0.2 1
2 4.7 3.2 1.3 0.2 1
3 4.6 3.1 1.5 0.2 1
4 5.0 3.6 1.4 0.2 1
5 4.6 3.4 1.4 0.3 1
6 5.0 3.4 1.5 0.2 1
7 4.4 2.9 1.4 0.2 1
8 4.9 3.1 1.5 0.1 1
9 4.8 3.4 1.6 0.2 1
10 4.8 3.0 1.4 0.1 1
Upvotes: 1
Reputation: 41230
library(dplyr)
df <- iris
# value to look for
val <- 4
# find columns
cols <- which(colSums(df == val , na.rm = TRUE) > 0L)
# filter rows
iris %>% filter_at(cols, any_vars(.==val))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.8 4.0 1.2 0.2 setosa
2 5.5 2.3 4.0 1.3 versicolor
3 6.0 2.2 4.0 1.0 versicolor
4 6.1 2.8 4.0 1.3 versicolor
5 5.5 2.5 4.0 1.3 versicolor
6 5.8 2.6 4.0 1.2 versicolor
Upvotes: 0
Reputation: 5788
Base R:
# Subset:
cols <- grep("codes", names(df2), value = TRUE)
df2[rowSums(sapply(cols,
function(x) {
df2[, x] >= 100 & df2[, x] <= 120
})) == length(cols), ]
# Data:
tmp <- data.frame(x1 <- rnorm(999, mean = 100, sd = 2))
df <-
setNames(data.frame(tmp[rep(1, each = 80)]), paste0("codes", 1:80))
df2 <- cbind(id = 1:nrow(df), df)
Upvotes: 1
Reputation: 389055
You can use filter_at
:
library(dplyr)
iris %>% filter_at(vars(contains('Sepal')), any_vars(between(., 4, 5)))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#3 4.6 3.1 1.5 0.2 setosa
#4 5.0 3.6 1.4 0.2 setosa
#5 4.6 3.4 1.4 0.3 setosa
#6 5.0 3.4 1.5 0.2 setosa
#7 4.4 2.9 1.4 0.2 setosa
#....
Upvotes: 2