Reputation: 331

DPLYR filter multiple groups each with their own criteria

I want each groups' numbers to be above the specified threshold. For example, I want group 1 to have value above .25, group 2 above .5, etc. REPREX below.

set.seed(1234)
group <- c(rep("group 1", 30),
           rep("group 2", 30), 
           rep("group 3", 30),
           rep("group 4", 30))

number <- c(runif(30, 0, .5),    #group 1 data
            runif(30, .25, .75), #group 2 data, etc.
            runif(30, .5, 1),
            runif(30, .75, 1.25))

d <- data.frame(group = group,
                number = number)

threshold <- c(.25, .5, .75, 1)

library(dplyr)

d %>% group_by(group) %>% filter(number >= threshold)

The final line returns the warning:

Warning messages:
1: In number >= threshold :
  longer object length is not a multiple of shorter object length
2: In number >= threshold :
  longer object length is not a multiple of shorter object length
3: In number >= threshold :
  longer object length is not a multiple of shorter object length
4: In number >= threshold :
  longer object length is not a multiple of shorter object length

Please advise. Thanks!

Upvotes: 2

Answers (3)

neilfws

Reputation: 33802

One way to do this using groups: add a column where the threshold is defined using the group index. The approach works for your example data, but may not be a general solution.

d %>% 
  group_by(group) %>% 
  mutate(threshold = cur_group_id() / 4) %>% 
  filter(number >= threshold)

     group    number    threshold
1  group 1 0.3111497    0.25
2  group 1 0.3046374    0.25
3  group 1 0.3116897    0.25
4  group 1 0.4304577    0.25
5  group 1 0.3201553    0.25
6  group 1 0.3330419    0.25
7  group 1 0.2571256    0.25
8  group 1 0.3467956    0.25
9  group 1 0.2724874    0.25
10 group 1 0.4617167    0.25
11 group 1 0.4186478    0.25
12 group 1 0.4052993    0.25
13 group 1 0.2628488    0.25
14 group 1 0.4573291    0.25
15 group 1 0.4156725    0.25
16 group 2 0.5036534    0.50
17 group 2 0.6298353    0.50
18 group 2 0.7460752    0.50
19 group 2 0.6536762    0.50
20 group 2 0.5266668    0.50
21 group 2 0.5732030    0.50
22 group 2 0.5609096    0.50
23 group 2 0.5009987    0.50
24 group 2 0.5885473    0.50
25 group 2 0.6327299    0.50
26 group 2 0.6086359    0.50
27 group 2 0.5022730    0.50
28 group 2 0.5019667    0.50
29 group 2 0.6256001    0.50
30 group 2 0.6741962    0.50
31 group 3 0.9324169    0.75
32 group 3 0.8532473    0.75
33 group 3 0.7542738    0.75
34 group 3 0.7822849    0.75
35 group 3 0.9464182    0.75
36 group 3 0.8915606    0.75
37 group 3 0.7595950    0.75
38 group 3 0.8342477    0.75
39 group 3 0.9632002    0.75
40 group 3 0.7721349    0.75
41 group 3 0.9492902    0.75
42 group 3 0.9480929    0.75
43 group 4 1.2002123    1.00
44 group 4 1.0057918    1.00
45 group 4 1.1210598    1.00
46 group 4 1.0325381    1.00
47 group 4 1.1066508    1.00
48 group 4 1.2251525    1.00
49 group 4 1.2065439    1.00
50 group 4 1.2229266    1.00
51 group 4 1.1485802    1.00

Upvotes: 0

smingerson

Reputation: 1438

It returns this warning because it is comparing the length-4 threshold vector to each group, rather than comparing the first threshold to the first group, etc.

set.seed(1234)
group <- c(rep("group 1", 30),
           rep("group 2", 30), 
           rep("group 3", 30),
           rep("group 4", 30))

number <- c(runif(30, 0, .5),    #group 1 data
            runif(30, .25, .75), #group 2 data, etc.
            runif(30, .5, 1),
            runif(30, .75, 1.25))

d <- data.frame(group = group,
                number = number)

threshold <- data.frame(group = c("group 1", "group 2", "group 3", "group 4"), 
                       threshold =c(.25, .5, .75, 1))

library(dplyr)

d %>% left_join(threshold, by = 'group') %>% 
  filter(number >= threshold)

By creating a lookup table and joining to it, we create a new column in d, threshold, which holds the right value for each group. Then, when we apply the filter, each value is compared to the correct threshold. By doing it this way, we don't even need the group_by!

Upvotes: 3

Ronak Shah

Reputation: 389275

One way would be to create a dataframe with group value and threshold

library(dplyr)
compare_df <- data.frame(group = paste('group', 1:4), threshold)

Now you can join this dataframe with d and filter

d %>%
  left_join(compare_df, by = 'group') %>%
  filter(number >= threshold)

Same in base R :

subset(merge(d, compare_df, by = 'group'), number >= threshold)

Upvotes: 3

DPLYR filter multiple groups each with their own criteria

Answers (3)

Related Questions