saQuist
saQuist

Reputation: 447

R keep rows with maximum value of one column when multiple rows have values close to each other in an other column

I have a data frame with dates and magnitudes. For every case where the dates are within 0.6 years from each other, I want to keep the date with the highest absolute magnitude and discard the other.

data:

set.seed(1)
library(dplyr)
mydf <- data.frame(date = c(2014.25, 2014.41, 2014.53, 2016.3, 
                            2016.7,2017.2,2018.5, 2019.35, 2019.8),
                   magnitude = round(rnorm(9, mean=0, sd=0.4),4))
mydf <- mydf %>% mutate(absmag = abs(magnitude))
mydf
> mydf
     date magnitude absmag
1 2014.25   -0.1222 0.1222
2 2014.41    0.6047 0.6047
3 2014.53    0.1559 0.1559
4 2016.30   -0.2485 0.2485
5 2016.70   -0.8859 0.8859
6 2017.20    0.4500 0.4500
7 2018.50   -0.0180 0.0180
8 2019.35   -0.0065 0.0065
9 2019.80    0.3775 0.3775

Desired output:

     date magnitude absmag
1 2014.41    0.6047 0.6047
2 2016.70   -0.8859 0.8859
3 2017.20    0.4500 0.4500
4 2018.50   -0.0180 0.0180
5 2019.80    0.3775 0.3775

The things I tried so far failed to incoropate the requirements in the bullet points.

This solution can only handle pairs of two items within 0.6 years from each other:

whichAreClose <- function(your.number, x, threshold = 0.6){
  x[which(abs(x - your.number) != 0 & abs(x - your.number) < thresh)]}
out1 <- sapply(mydf$date, 
                FUN = whichAreClose, 
                x = mydf$date) %>% 
  unlist() %>% 
  split(., cut(seq_along(.), 2, labels = FALSE)) %>% 
  lapply(
    ., function(i){
      mydf %>% 
        filter(date %in% i) %>% 
        slice_min(absmag)}) %>% 
  bind_rows(.) %>% 
  anti_join(mydf, .)

> out1
     date magnitude absmag
1 2014.41    0.6047 0.6047
2 2014.53    0.1559 0.1559
3 2016.30   -0.2485 0.2485
4 2016.70   -0.8859 0.8859
5 2017.20    0.4500 0.4500
6 2018.50   -0.0180 0.0180
7 2019.80    0.3775 0.3775

and this solution cannot distinguish different pairs at all:

out2 <- mydf %>% 
  mutate(prevdist = abs(date - lag(date)),
         nextdist = abs(date - lead(date)),
         ispair = case_when(prevdist < 0.6 ~ 'yes',
                            nextdist < 0.6 ~ 'yes',
                            TRUE ~ 'no')) %>% 
  filter(ispair == 'yes') %>% 
  slice_min(absmag) %>% 
  anti_join(mydf, .)

> out2
     date magnitude absmag
1 2014.25   -0.5883 0.5883
2 2014.41   -0.1913 0.1913
3 2014.53    0.1672 0.1672
4 2016.30    0.5435 0.5435
5 2017.20    0.1551 0.1551
6 2018.50   -0.0215 0.0215
7 2019.35   -0.5508 0.5508
8 2019.80   -0.1660 0.1660

P.S.: feel free to edit the title. I struggled to come up with a good one myself.

Upvotes: 4

Views: 268

Answers (2)

Stefano Barbi
Stefano Barbi

Reputation: 3194

You can try to perform complete clustering on dates by using hclust. The manhattan (i.e. absolute) distances are calculated between pairs of dates. The "complete" clustering method will ensure that every member of a cluster cut at h height will be distant at most h from the other members.

mydf |>
  mutate(k = {
    dist(date, method = "manhattan") |>
      hclust(method = "complete") |>
      cutree(h = .6)
  })

#>     date magnitude absmag k
#>1 2014.25   -0.2506 0.2506 1
#>2 2014.41    0.0735 0.0735 1
#>3 2014.53   -0.3343 0.3343 1
#>4 2016.30    0.6381 0.6381 2
#>5 2016.70    0.1318 0.1318 2
#>6 2017.20   -0.3282 0.3282 3
#>7 2018.50    0.1950 0.1950 4
#>8 2019.35    0.2953 0.2953 5
#>9 2019.80    0.2303 0.2303 5

Here is shown the dendrogram obtained:

dendrogram

mydf |>
  mutate(k = {
    dist(date, method = "manhattan") |>
      hclust(method = "complete") |>
      cutree(h = .6)
  }) |>
  group_by(k) |>
  filter(absmag == max(absmag)) |>
  as.data.frame()

#>      date magnitude absmag k
#> 1 2014.53   -0.3343 0.3343 1
#> 2 2016.30    0.6381 0.6381 2
#> 3 2017.20   -0.3282 0.3282 3
#> 4 2018.50    0.1950 0.1950 4
#> 5 2019.35    0.2953 0.2953 5

Upvotes: 2

Ma&#235;l
Ma&#235;l

Reputation: 52049

With cumsum and purrr::accumulate.

library(tidyverse)

mydf %>% 
  group_by(cum = cumsum(accumulate(c(0, diff(date)), ~ifelse(.x + .y <= 0.6, .x + .y, 0)) == 0)) %>% 
  slice_max(absmag)

     date magnitude absmag    cum
1 2014.53   -0.3343 0.3343      1
2 2016.30    0.6381 0.6381      2
3 2017.20   -0.3282 0.3282      3
4 2018.50    0.1950 0.1950      4
5 2019.35    0.2953 0.2953      5

Explanation:

purrr:accumulate takes the difference between the values of the vectors as input, and if the cumulative sum (.x + .y) is below 0.6, it outputs the cumulative sum (for instance for the first three elements), but if it's higher than 0.6, the cumsum resets to 0. So for the fourth element, because 0.28 + 1.77 > 0.6, the function sets the fourth element to 0.

c(0, diff(mydf$date))
# [1] 0.00 0.16 0.12 1.77 0.40 0.50 1.30 0.85 0.45

accumulate(c(0, diff(mydf$date)), ~ifelse(.x + .y <= 0.6, .x + .y, 0))
# [1] 0.00 0.16 0.28 0.00 0.40 0.00 0.00 0.00 0.45

Everytime the cumsum resets to 0, it is actually a new group with subsequent rows being below the thresholds. So, to capture groups, one can use cumsum(... == 0):

cumsum(accumulate(c(0, diff(mydf$date)), ~ifelse(.x + .y <= 0.6, .x + .y, 0)) == 0)
# [1] 1 1 1 2 2 3 4 5 5

Upvotes: 1

Related Questions