Average rows of dataframe with duplicated values in more than n numerical columns

Question

I would like to be able to use dplyr to average rows in which there are identical values in ANY n or more numerical columns, and an identical value in the a column.

If:

n <- 3

and

df <- data.frame(a = c("one", "one", "one", "one", "three"),
                 b = c(1,1,1,2,3),
                 c = c(2,2,2,7,12),
                 d = c(6,6,7,8,10),
                 e = c(1,4,1,3,4))

then I would like the first three rows to be averaged (because 3 out of 4 numerical values are identical between them, and the value in a is also identical). I would NOT want row four to be included in the average, because although the value in a is identical, it has no identical numerical values.

Before:

    a      b   c   d   e   
[1] one    1   2   6   1
[2] one    1   2   6   4
[3] one    1   2   7   1
[4] one    2   7   8   3
[5] four   3  12  10   4

After:

    a      b   c    d   e   
[1] one    1   2   6.3  2
[2] one    2   7    8   3
[3] four   3  12   10   4

My data frame is much bigger in real life and contains plenty of other columns.

EDIT: Rows [1] and [2] have 3 identical values (in columns b, c and d. Rows [1] and [3] have 3 identical values (in columns b, c and e. This is why I want them averaged.

Stefano Barbi · Accepted Answer

Here, I first group the dataframe by the column a. Then, for each sub-dataframe, I calculate a distance matrix based on the number of different elements between the rows.

The package proxy is used because it allows easy calculation of a custom distance.

Then, I perform single-linkage clustering and cut the tree at an height just above 1. This will ensure that every member in a cluster will have at least (4 - 1) = 3 elements in common with at least another member of the same cluster.

Finally, I summarize each sub-dataframe by the cluster number gid.

library(dplyr)
library(tidyr)
library(proxy)

n <- 3

df <- data.frame(a = c("one", "one", "one", "one", "three"),
                 b = c(1,1,1,2,3),
                 c = c(2,2,2,7,12),
                 d = c(6,6,7,8,10),
                 e = c(1,4,1,3,4))



df |>
  group_by(a) |>
  group_modify(~{
    gid <- if(nrow(.x) > 1)
               proxy::dist(.x, method = \(a,b) sum(a != b)) |>
               hclust(method="single") |>
               cutree(h = 0.1 + ncol(.x) - n)
           else
             1
    group_by(cbind(.x, gid), gid) |>
      summarize(across(everything(), mean))
  })

# A tibble: 3 × 6
# Groups:   a [2]
  a       gid     b     c     d     e
       
1 one       1     1     2  6.33     2
2 one       2     2     7  8        3
3 three     1     3    12 10        4

Here is an example dendrogram obtained from the first 4 rows:

Average rows of dataframe with duplicated values in more than n numerical columns

Answers (1)

Related Questions