Reputation: 564
I'm trying to find a sudden decrease in a value (column v44
) within many small groups (of file_id
and type
) in a dataframe/tibble (dat
).
I want to first get rid of all the values that are too high or too low and then select the first one. I calculate the difference between values, v44_diff
. The first value within each group should then be used for flagging subsequent values, which should not show a greater than fac
* the decrease than the initial value.
EDIT: Ok ok, I've rewritten below to a small reprex.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
library(ggplot2)
min <- 3000
max <- 50000
fac <- 1.5
dat <- tribble( ~ file_id, ~ type, ~ cycle, ~ v44,
"hey", "std", 0, 50300,
"hey", "std", 1, 40000,
"hey", "std", 2, 35000,
"hey", "std", 3, 32000,
"hey", "std", 4, 31000,
"hey", "std", 5, 30000,
"hey", "std", 6, 29500,
"hey", "smp", 1, 40100,
"hey", "smp", 2, 35100,
"hey", "smp", 3, 32100,
"hey", "smp", 4, 5000,
"hey", "smp", 5, 20,
"hey", "smp", 6, 10,
"hi", "std", 0, 49000,
"hi", "std", 1, 39700,
"hi", "std", 2, 32000,
"hi", "std", 3, 30000,
"hi", "std", 4, 29500,
"hi", "std", 5, 29400,
"hi", "std", 6, 29200,
"hi", "smp", 1, 49100,
"hi", "smp", 2, 39600,
"hi", "smp", 3, 31100,
"hi", "smp", 4, 30000,
"hi", "smp", 5, 29600,
"hi", "smp", 6, 29400)
dat %>%
ggplot(aes(x = cycle, y = v44, colour = type)) +
geom_line(aes(group = paste(file_id, type))) +
facet_grid(rows = vars(type))
dat %>%
group_by(file_id, type) %>%
mutate(v44_low = v44 <= min, # creates a flag
v44_high = v44 >= max,
v44_diff = lead(v44) - v44) %>%
mutate(v44_drop = v44_diff < fac * first(filter(., !v44_low, !v44_high)$v44_diff)) %>%
ungroup(file_id, type)
#> # A tibble: 26 x 8
#> file_id type cycle v44 v44_low v44_high v44_diff v44_drop
#> <chr> <chr> <dbl> <dbl> <lgl> <lgl> <dbl> <lgl>
#> 1 hey std 0 50300 FALSE TRUE -10300 TRUE
#> 2 hey std 1 40000 FALSE FALSE -5000 FALSE
#> 3 hey std 2 35000 FALSE FALSE -3000 FALSE
#> 4 hey std 3 32000 FALSE FALSE -1000 FALSE
#> 5 hey std 4 31000 FALSE FALSE -1000 FALSE
#> 6 hey std 5 30000 FALSE FALSE -500 FALSE
#> 7 hey std 6 29500 FALSE FALSE NA NA
#> 8 hey smp 1 40100 FALSE FALSE -5000 FALSE
#> 9 hey smp 2 35100 FALSE FALSE -3000 FALSE
#> 10 hey smp 3 32100 FALSE FALSE -27100 TRUE
#> # … with 16 more rows
but this turned out to be very very slow, as there are many groups.
See https://github.com/tidyverse/dplyr/issues/3294 for an explanation of why filtering within many groups is slow.
I know how to rewrite this to a faster version, but it will still create a copy:
out <- dat %>%
group_by(file_id, type) %>%
mutate(v44_low = v44 <= min,
v44_high = v44 >= max,
v44_diff = lead(v44) - v44) %>%
filter(!v44_low, !v44_high) %>%
mutate(v44_drop = v44_diff < fac * first(.$v44_diff)) %>%
select(file_id, type, cycle, v44_drop)
out <- dat %>%
left_join(out, by = c("file_id", "type", "cycle")) %>%
ungroup(file_id, type)
out
#> # A tibble: 26 x 5
#> file_id type cycle v44 v44_drop
#> <chr> <chr> <dbl> <dbl> <lgl>
#> 1 hey std 0 50300 NA
#> 2 hey std 1 40000 FALSE
#> 3 hey std 2 35000 FALSE
#> 4 hey std 3 32000 FALSE
#> 5 hey std 4 31000 FALSE
#> 6 hey std 5 30000 FALSE
#> 7 hey std 6 29500 NA
#> 8 hey smp 1 40100 FALSE
#> 9 hey smp 2 35100 FALSE
#> 10 hey smp 3 32100 TRUE
#> # … with 16 more rows
Created on 2020-02-26 by the reprex package (v0.3.0)
This is because I want to keep the rows with high/low values, but I don't want them to be used to calculate the difference.
Is there any way I can rewrite this to be faster? Where I don't have to filter within groups, and I don't have to create a copy to merge back into the final result?
Upvotes: 1
Views: 795
Reputation: 46898
When you do group_by, first
operates on a group level. So for example we use a simple test:
dat %>%
group_by(file_id, type) %>%
mutate(is_first = v44 == first(v44))
# A tibble: 26 x 5
# Groups: file_id, type [4]
file_id type cycle v44 is_first
<chr> <chr> <dbl> <dbl> <lgl>
1 hey std 0 50300 TRUE
2 hey std 1 40000 FALSE
3 hey std 2 35000 FALSE
4 hey std 3 32000 FALSE
5 hey std 4 31000 FALSE
6 hey std 5 30000 FALSE
7 hey std 6 29500 FALSE
8 hey smp 1 40100 TRUE
9 hey smp 2 35100 FALSE
10 hey smp 3 32100 FALSE
But if you call .$v44, you are doing it for the whole data.frame, outside of the group:
dat %>%
group_by(file_id, type) %>%
mutate(is_first = v44 == first(.$v44))
file_id type cycle v44 is_first
<chr> <chr> <dbl> <dbl> <lgl>
1 hey std 0 50300 TRUE
2 hey std 1 40000 FALSE
3 hey std 2 35000 FALSE
4 hey std 3 32000 FALSE
5 hey std 4 31000 FALSE
6 hey std 5 30000 FALSE
7 hey std 6 29500 FALSE
8 hey smp 1 40100 FALSE
9 hey smp 2 35100 FALSE
10 hey smp 3 32100 FALSE
You can see for row 8, the first value doesn't show TRUE. So if you are only defining differences within the group, do not use .$ .
From the problem you have described, you only need to work on the vector. If you use filter, you are working on the whole data frame, so my suggestion is to use [ which works on the vector. In the example below I also replaced the differences for values outside min and max with 0:
test = dat %>%
group_by(file_id, type) %>%
mutate(v44_diff = lead(v44) - v44) %>%
mutate(v44_diff = replace(v44_diff,v44 < min | v44 > max,0)) %>%
mutate(v44_drop = v44_diff < fac*first(v44_diff[v44_diff!=0])) %>%
ungroup(file_id, type)
Upvotes: 1
Reputation: 191
Try the following generic steps:-
my_new_data <- as.data.frame(my_data, c(my_data != head(my_data)| my_data != tail(my_data)))
Try this...
Upvotes: 0