R keep rows with maximum value of one column when multiple rows have values close to each other in an other column

Question

I have a data frame with dates and magnitudes. For every case where the dates are within 0.6 years from each other, I want to keep the date with the highest absolute magnitude and discard the other.

This includes cases where multiple dates are all within 0.6 years from each other. Like c(2014.2, 2014.4, 2014.5) which should give `c(2014.4) if that year had the highest absolute magnitude.
For cases where multiple years could be chained using this criterion (like c(2016.3, 2016.7, 2017.2), where 2016.3 and 2017.2 are not within 0.6 years from each other), I want to treat the dates that are closest to one another as a pair and consider the extra date in the criterion as a next candidate for another pair, (so the output will read like this c(2016.3, ~~2016.7~~, 2017.2) if 2016.3 had the highest absolute magnitude).

data:

set.seed(1)
library(dplyr)
mydf <- data.frame(date = c(2014.25, 2014.41, 2014.53, 2016.3, 
                            2016.7,2017.2,2018.5, 2019.35, 2019.8),
                   magnitude = round(rnorm(9, mean=0, sd=0.4),4))
mydf <- mydf %>% mutate(absmag = abs(magnitude))
mydf
> mydf
     date magnitude absmag
1 2014.25   -0.1222 0.1222
2 2014.41    0.6047 0.6047
3 2014.53    0.1559 0.1559
4 2016.30   -0.2485 0.2485
5 2016.70   -0.8859 0.8859
6 2017.20    0.4500 0.4500
7 2018.50   -0.0180 0.0180
8 2019.35   -0.0065 0.0065
9 2019.80    0.3775 0.3775

Desired output:

     date magnitude absmag
1 2014.41    0.6047 0.6047
2 2016.70   -0.8859 0.8859
3 2017.20    0.4500 0.4500
4 2018.50   -0.0180 0.0180
5 2019.80    0.3775 0.3775

The things I tried so far failed to incoropate the requirements in the bullet points.

This solution can only handle pairs of two items within 0.6 years from each other:

whichAreClose <- function(your.number, x, threshold = 0.6){
  x[which(abs(x - your.number) != 0 & abs(x - your.number) < thresh)]}
out1 <- sapply(mydf$date, 
                FUN = whichAreClose, 
                x = mydf$date) %>% 
  unlist() %>% 
  split(., cut(seq_along(.), 2, labels = FALSE)) %>% 
  lapply(
    ., function(i){
      mydf %>% 
        filter(date %in% i) %>% 
        slice_min(absmag)}) %>% 
  bind_rows(.) %>% 
  anti_join(mydf, .)

> out1
     date magnitude absmag
1 2014.41    0.6047 0.6047
2 2014.53    0.1559 0.1559
3 2016.30   -0.2485 0.2485
4 2016.70   -0.8859 0.8859
5 2017.20    0.4500 0.4500
6 2018.50   -0.0180 0.0180
7 2019.80    0.3775 0.3775

and this solution cannot distinguish different pairs at all:

out2 <- mydf %>% 
  mutate(prevdist = abs(date - lag(date)),
         nextdist = abs(date - lead(date)),
         ispair = case_when(prevdist < 0.6 ~ 'yes',
                            nextdist < 0.6 ~ 'yes',
                            TRUE ~ 'no')) %>% 
  filter(ispair == 'yes') %>% 
  slice_min(absmag) %>% 
  anti_join(mydf, .)

> out2
     date magnitude absmag
1 2014.25   -0.5883 0.5883
2 2014.41   -0.1913 0.1913
3 2014.53    0.1672 0.1672
4 2016.30    0.5435 0.5435
5 2017.20    0.1551 0.1551
6 2018.50   -0.0215 0.0215
7 2019.35   -0.5508 0.5508
8 2019.80   -0.1660 0.1660

P.S.: feel free to edit the title. I struggled to come up with a good one myself.

Stefano Barbi · Accepted Answer

You can try to perform complete clustering on dates by using hclust. The manhattan (i.e. absolute) distances are calculated between pairs of dates. The "complete" clustering method will ensure that every member of a cluster cut at h height will be distant at most h from the other members.

mydf |>
  mutate(k = {
    dist(date, method = "manhattan") |>
      hclust(method = "complete") |>
      cutree(h = .6)
  })

#>     date magnitude absmag k
#>1 2014.25   -0.2506 0.2506 1
#>2 2014.41    0.0735 0.0735 1
#>3 2014.53   -0.3343 0.3343 1
#>4 2016.30    0.6381 0.6381 2
#>5 2016.70    0.1318 0.1318 2
#>6 2017.20   -0.3282 0.3282 3
#>7 2018.50    0.1950 0.1950 4
#>8 2019.35    0.2953 0.2953 5
#>9 2019.80    0.2303 0.2303 5

Here is shown the dendrogram obtained:

mydf |>
  mutate(k = {
    dist(date, method = "manhattan") |>
      hclust(method = "complete") |>
      cutree(h = .6)
  }) |>
  group_by(k) |>
  filter(absmag == max(absmag)) |>
  as.data.frame()

#>      date magnitude absmag k
#> 1 2014.53   -0.3343 0.3343 1
#> 2 2016.30    0.6381 0.6381 2
#> 3 2017.20   -0.3282 0.3282 3
#> 4 2018.50    0.1950 0.1950 4
#> 5 2019.35    0.2953 0.2953 5

R keep rows with maximum value of one column when multiple rows have values close to each other in an other column

Answers (2)

Related Questions