Reputation: 23
I have a df (“df”) containing multiple time series (value ~ time) whose observations are grouped by 3 factors: temp, rep, and species. These data need to be trimmed at the lower and upper ends of the time series, but these threshold values are group conditional (e.g. remove observations below 2 and above 10 where temp=10, rep=2, and species = “A”). I have an accompanying df (df_thresholds) that contains grouping values and the mins and maxs i want to use for each group. Not all groups need trimming (I would like to update this file regularly which would guide where to trim df). Can anybody help me conditionally filter out these values by group? I have the following, which is close but not quite there. When I reverse the max and min boolean tests, I get zero observations.
df <- data.frame(species = c(rep("A", 16), rep("B", 16)),
temp=as.factor(c(rep(10,4),rep(20,4),rep(10,4),rep(20,4))),
rep=as.factor(c(rep(1,8),rep(2,8),rep(1,8),rep(2,8))),
time=rep(seq(1:4),4),
value=c(1,4,8,16,2,4,9,16,2,4,10,16,2,4,15,16,2,4,6,16,1,4,8,16,1,2,8,16,2,3,4,16))
df_thresholds <- data.frame(species=c("A", "A", "B"),
temp=as.factor(c(10,20,10)),
rep=as.factor(c(1,1,2)),
min_value=c(2,4,2),
max_value=c(10,10,9))
#desired outcome
df_desired <- df[c(2:3,6:7,9:24,26:27,29:nrow(df)),]
#attempt
df2 <- df
for (i in 1:nrow(df_thresholds)) {
df2 <- df2 %>%
filter(!(species==df_thresholds$species[i] & temp==df_thresholds$temp[i] & rep==df_thresholds$rep[i] & value>df_thresholds$min_value[i] & value<df_thresholds$max_value[i]))
}
EDIT: Here's the solution I implemented per suggestions below.
df_test <- left_join(df, df_thresholds, by=c('species','temp','rep'))
df_test$min_value[is.na(df_test$min_value)] <- 0
df_test$max_value[is.na(df_test$max_value)] <- 999
df_test2 <- df_test %>%
filter(value >= min_value & value <= max_value)
Upvotes: 0
Views: 104
Reputation: 388982
We can find out indices which we want to exclude using mapply
df[-c(with(df_thresholds,
mapply(function(x, y, z, min_x, max_x)
which(df$species == x & df$temp == y & df$rep == z &
(df$value < min_x | df$value > max_x)),
species, temp, rep, min_value, max_value))), ]
# species temp rep time value
#2 A 10 1 2 4
#3 A 10 1 3 8
#6 A 20 1 2 4
#7 A 20 1 3 9
#9 A 10 2 1 2
#10 A 10 2 2 4
#11 A 10 2 3 10
#12 A 10 2 4 16
#......
In mapply
we pass all the columns of df_thresholds
filter df
accordingly and find out indices which are outside min and max value for each row and exclude them from the original dataframe.
The result of mapply
call is
#[1] 1 4 5 8 25 28
which are the rows we want to exclude from the df
since they fall out of range.
Upvotes: 0