Reputation: 447
I have a data frame with dates and magnitudes. For every case where the dates are within 0.6 years from each other, I want to keep the date with the highest absolute magnitude and discard the other.
c(2014.2, 2014.4, 2014.5)
which should give `c(2014.4) if that year had the highest absolute magnitude.c(2016.3, 2016.7, 2017.2)
, where 2016.3 and 2017.2 are not within 0.6 years from each other), I want to treat the dates that are closest to one another as a pair and consider the extra date in the criterion as a next candidate for another pair, (so the output will read like this c(2016.3,
2017.2)
if 2016.3 had the highest absolute magnitude).data:
set.seed(1)
library(dplyr)
mydf <- data.frame(date = c(2014.25, 2014.41, 2014.53, 2016.3,
2016.7,2017.2,2018.5, 2019.35, 2019.8),
magnitude = round(rnorm(9, mean=0, sd=0.4),4))
mydf <- mydf %>% mutate(absmag = abs(magnitude))
mydf
> mydf
date magnitude absmag
1 2014.25 -0.1222 0.1222
2 2014.41 0.6047 0.6047
3 2014.53 0.1559 0.1559
4 2016.30 -0.2485 0.2485
5 2016.70 -0.8859 0.8859
6 2017.20 0.4500 0.4500
7 2018.50 -0.0180 0.0180
8 2019.35 -0.0065 0.0065
9 2019.80 0.3775 0.3775
Desired output:
date magnitude absmag
1 2014.41 0.6047 0.6047
2 2016.70 -0.8859 0.8859
3 2017.20 0.4500 0.4500
4 2018.50 -0.0180 0.0180
5 2019.80 0.3775 0.3775
The things I tried so far failed to incoropate the requirements in the bullet points.
This solution can only handle pairs of two items within 0.6 years from each other:
whichAreClose <- function(your.number, x, threshold = 0.6){
x[which(abs(x - your.number) != 0 & abs(x - your.number) < thresh)]}
out1 <- sapply(mydf$date,
FUN = whichAreClose,
x = mydf$date) %>%
unlist() %>%
split(., cut(seq_along(.), 2, labels = FALSE)) %>%
lapply(
., function(i){
mydf %>%
filter(date %in% i) %>%
slice_min(absmag)}) %>%
bind_rows(.) %>%
anti_join(mydf, .)
> out1
date magnitude absmag
1 2014.41 0.6047 0.6047
2 2014.53 0.1559 0.1559
3 2016.30 -0.2485 0.2485
4 2016.70 -0.8859 0.8859
5 2017.20 0.4500 0.4500
6 2018.50 -0.0180 0.0180
7 2019.80 0.3775 0.3775
and this solution cannot distinguish different pairs at all:
out2 <- mydf %>%
mutate(prevdist = abs(date - lag(date)),
nextdist = abs(date - lead(date)),
ispair = case_when(prevdist < 0.6 ~ 'yes',
nextdist < 0.6 ~ 'yes',
TRUE ~ 'no')) %>%
filter(ispair == 'yes') %>%
slice_min(absmag) %>%
anti_join(mydf, .)
> out2
date magnitude absmag
1 2014.25 -0.5883 0.5883
2 2014.41 -0.1913 0.1913
3 2014.53 0.1672 0.1672
4 2016.30 0.5435 0.5435
5 2017.20 0.1551 0.1551
6 2018.50 -0.0215 0.0215
7 2019.35 -0.5508 0.5508
8 2019.80 -0.1660 0.1660
P.S.: feel free to edit the title. I struggled to come up with a good one myself.
Upvotes: 4
Views: 268
Reputation: 3194
You can try to perform complete clustering on dates by using hclust
. The manhattan (i.e. absolute) distances are calculated between pairs of dates. The "complete" clustering method will ensure that every member of a cluster cut at h
height will be distant at most h
from the other members.
mydf |>
mutate(k = {
dist(date, method = "manhattan") |>
hclust(method = "complete") |>
cutree(h = .6)
})
#> date magnitude absmag k
#>1 2014.25 -0.2506 0.2506 1
#>2 2014.41 0.0735 0.0735 1
#>3 2014.53 -0.3343 0.3343 1
#>4 2016.30 0.6381 0.6381 2
#>5 2016.70 0.1318 0.1318 2
#>6 2017.20 -0.3282 0.3282 3
#>7 2018.50 0.1950 0.1950 4
#>8 2019.35 0.2953 0.2953 5
#>9 2019.80 0.2303 0.2303 5
Here is shown the dendrogram obtained:
mydf |>
mutate(k = {
dist(date, method = "manhattan") |>
hclust(method = "complete") |>
cutree(h = .6)
}) |>
group_by(k) |>
filter(absmag == max(absmag)) |>
as.data.frame()
#> date magnitude absmag k
#> 1 2014.53 -0.3343 0.3343 1
#> 2 2016.30 0.6381 0.6381 2
#> 3 2017.20 -0.3282 0.3282 3
#> 4 2018.50 0.1950 0.1950 4
#> 5 2019.35 0.2953 0.2953 5
Upvotes: 2
Reputation: 52049
With cumsum
and purrr::accumulate
.
library(tidyverse)
mydf %>%
group_by(cum = cumsum(accumulate(c(0, diff(date)), ~ifelse(.x + .y <= 0.6, .x + .y, 0)) == 0)) %>%
slice_max(absmag)
date magnitude absmag cum
1 2014.53 -0.3343 0.3343 1
2 2016.30 0.6381 0.6381 2
3 2017.20 -0.3282 0.3282 3
4 2018.50 0.1950 0.1950 4
5 2019.35 0.2953 0.2953 5
Explanation:
purrr:accumulate
takes the difference between the values of the vectors as input, and if the cumulative sum (.x + .y
) is below 0.6, it outputs the cumulative sum (for instance for the first three elements), but if it's higher than 0.6, the cumsum resets to 0. So for the fourth element, because 0.28 + 1.77 > 0.6
, the function sets the fourth element to 0
.
c(0, diff(mydf$date))
# [1] 0.00 0.16 0.12 1.77 0.40 0.50 1.30 0.85 0.45
accumulate(c(0, diff(mydf$date)), ~ifelse(.x + .y <= 0.6, .x + .y, 0))
# [1] 0.00 0.16 0.28 0.00 0.40 0.00 0.00 0.00 0.45
Everytime the cumsum resets to 0, it is actually a new group with subsequent rows being below the thresholds. So, to capture groups, one can use cumsum(... == 0)
:
cumsum(accumulate(c(0, diff(mydf$date)), ~ifelse(.x + .y <= 0.6, .x + .y, 0)) == 0)
# [1] 1 1 1 2 2 3 4 5 5
Upvotes: 1