How to select the first value within a group - without filtering?

Question

I'm trying to find a sudden decrease in a value (column v44) within many small groups (of file_id and type) in a dataframe/tibble (dat).

I want to first get rid of all the values that are too high or too low and then select the first one. I calculate the difference between values, v44_diff. The first value within each group should then be used for flagging subsequent values, which should not show a greater than fac* the decrease than the initial value.

EDIT: Ok ok, I've rewritten below to a small reprex.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tibble)
library(ggplot2)

min <- 3000
max <- 50000
fac <- 1.5

dat <- tribble( ~ file_id, ~ type, ~ cycle, ~ v44,
               "hey", "std", 0, 50300,
               "hey", "std", 1, 40000,
               "hey", "std", 2, 35000,
               "hey", "std", 3, 32000,
               "hey", "std", 4, 31000,
               "hey", "std", 5, 30000,
               "hey", "std", 6, 29500,
               "hey", "smp", 1, 40100,
               "hey", "smp", 2, 35100,
               "hey", "smp", 3, 32100,
               "hey", "smp", 4, 5000,
               "hey", "smp", 5, 20,
               "hey", "smp", 6, 10,
               "hi", "std", 0, 49000,
               "hi", "std", 1, 39700,
               "hi", "std", 2, 32000,
               "hi", "std", 3, 30000,
               "hi", "std", 4, 29500,
               "hi", "std", 5, 29400,
               "hi", "std", 6, 29200,
               "hi", "smp", 1, 49100,
               "hi", "smp", 2, 39600,
               "hi", "smp", 3, 31100,
               "hi", "smp", 4, 30000,
               "hi", "smp", 5, 29600,
               "hi", "smp", 6, 29400)

dat %>%
  ggplot(aes(x = cycle, y = v44, colour = type)) +
  geom_line(aes(group = paste(file_id, type))) +
  facet_grid(rows = vars(type))


dat %>%
    group_by(file_id, type) %>%
    mutate(v44_low = v44 <= min,  # creates a flag
           v44_high = v44 >= max,
           v44_diff = lead(v44) - v44) %>%
    mutate(v44_drop = v44_diff < fac * first(filter(., !v44_low, !v44_high)$v44_diff)) %>%
    ungroup(file_id, type)
#> # A tibble: 26 x 8
#>    file_id type  cycle   v44 v44_low v44_high v44_diff v44_drop
#>                        
#>  1 hey     std       0 50300 FALSE   TRUE       -10300 TRUE    
#>  2 hey     std       1 40000 FALSE   FALSE       -5000 FALSE   
#>  3 hey     std       2 35000 FALSE   FALSE       -3000 FALSE   
#>  4 hey     std       3 32000 FALSE   FALSE       -1000 FALSE   
#>  5 hey     std       4 31000 FALSE   FALSE       -1000 FALSE   
#>  6 hey     std       5 30000 FALSE   FALSE        -500 FALSE   
#>  7 hey     std       6 29500 FALSE   FALSE          NA NA      
#>  8 hey     smp       1 40100 FALSE   FALSE       -5000 FALSE   
#>  9 hey     smp       2 35100 FALSE   FALSE       -3000 FALSE   
#> 10 hey     smp       3 32100 FALSE   FALSE      -27100 TRUE    
#> # … with 16 more rows

but this turned out to be very very slow, as there are many groups.

See https://github.com/tidyverse/dplyr/issues/3294 for an explanation of why filtering within many groups is slow.

I know how to rewrite this to a faster version, but it will still create a copy:

out <- dat %>%
  group_by(file_id, type) %>%
  mutate(v44_low = v44 <= min,
         v44_high = v44 >= max,
         v44_diff = lead(v44) - v44) %>%
  filter(!v44_low, !v44_high) %>%
  mutate(v44_drop = v44_diff < fac * first(.$v44_diff)) %>%
  select(file_id, type, cycle, v44_drop)

out <- dat %>%
    left_join(out, by = c("file_id", "type", "cycle")) %>%
    ungroup(file_id, type)
out
#> # A tibble: 26 x 5
#>    file_id type  cycle   v44 v44_drop
#>             
#>  1 hey     std       0 50300 NA      
#>  2 hey     std       1 40000 FALSE   
#>  3 hey     std       2 35000 FALSE   
#>  4 hey     std       3 32000 FALSE   
#>  5 hey     std       4 31000 FALSE   
#>  6 hey     std       5 30000 FALSE   
#>  7 hey     std       6 29500 NA      
#>  8 hey     smp       1 40100 FALSE   
#>  9 hey     smp       2 35100 FALSE   
#> 10 hey     smp       3 32100 TRUE    
#> # … with 16 more rows

^{Created on 2020-02-26 by the reprex package (v0.3.0)}

This is because I want to keep the rows with high/low values, but I don't want them to be used to calculate the difference.

Is there any way I can rewrite this to be faster? Where I don't have to filter within groups, and I don't have to create a copy to merge back into the final result?

StupidWolf · Accepted Answer

When you do group_by, first operates on a group level. So for example we use a simple test:

dat %>%
group_by(file_id, type) %>%
mutate(is_first = v44 == first(v44))

# A tibble: 26 x 5
# Groups:   file_id, type [4]
   file_id type  cycle   v44 is_first
            
 1 hey     std       0 50300 TRUE    
 2 hey     std       1 40000 FALSE   
 3 hey     std       2 35000 FALSE   
 4 hey     std       3 32000 FALSE   
 5 hey     std       4 31000 FALSE   
 6 hey     std       5 30000 FALSE   
 7 hey     std       6 29500 FALSE   
 8 hey     smp       1 40100 TRUE    
 9 hey     smp       2 35100 FALSE   
10 hey     smp       3 32100 FALSE

But if you call .$v44, you are doing it for the whole data.frame, outside of the group:

dat %>%
group_by(file_id, type) %>%
mutate(is_first = v44 == first(.$v44))

   file_id type  cycle   v44 is_first
            
 1 hey     std       0 50300 TRUE    
 2 hey     std       1 40000 FALSE   
 3 hey     std       2 35000 FALSE   
 4 hey     std       3 32000 FALSE   
 5 hey     std       4 31000 FALSE   
 6 hey     std       5 30000 FALSE   
 7 hey     std       6 29500 FALSE   
 8 hey     smp       1 40100 FALSE   
 9 hey     smp       2 35100 FALSE   
10 hey     smp       3 32100 FALSE

You can see for row 8, the first value doesn't show TRUE. So if you are only defining differences within the group, do not use .$ .

From the problem you have described, you only need to work on the vector. If you use filter, you are working on the whole data frame, so my suggestion is to use [ which works on the vector. In the example below I also replaced the differences for values outside min and max with 0:

test = dat %>%
  group_by(file_id, type) %>%
  mutate(v44_diff = lead(v44) - v44) %>%
  mutate(v44_diff = replace(v44_diff,v44 < min | v44 > max,0)) %>%
  mutate(v44_drop = v44_diff < fac*first(v44_diff[v44_diff!=0])) %>%
  ungroup(file_id, type)

How to select the first value within a group - without filtering?

Answers (2)

Suppose the name of the variable in which you are storing the data is my_data

Related Questions