Reputation: 548
I would like to be able to use dplyr
to average rows in which there are identical values in ANY n
or more numerical columns, and an identical value in the a
column.
If:
n <- 3
and
df <- data.frame(a = c("one", "one", "one", "one", "three"),
b = c(1,1,1,2,3),
c = c(2,2,2,7,12),
d = c(6,6,7,8,10),
e = c(1,4,1,3,4))
then I would like the first three rows to be averaged (because 3 out of 4 numerical values are identical between them, and the value in a
is also identical). I would NOT want row four to be included in the average, because although the value in a
is identical, it has no identical numerical values.
Before:
a b c d e
[1] one 1 2 6 1
[2] one 1 2 6 4
[3] one 1 2 7 1
[4] one 2 7 8 3
[5] four 3 12 10 4
After:
a b c d e
[1] one 1 2 6.3 2
[2] one 2 7 8 3
[3] four 3 12 10 4
My data frame is much bigger in real life and contains plenty of other columns.
EDIT:
Rows [1]
and [2]
have 3 identical values (in columns b
, c
and d
. Rows [1]
and [3]
have 3 identical values (in columns b
, c
and e
. This is why I want them averaged.
Upvotes: 2
Views: 113
Reputation: 3184
Here, I first group the dataframe by the column a
. Then, for each sub-dataframe, I calculate a distance matrix based on the number of different elements between the rows.
The package proxy
is used because it allows easy calculation of a custom distance.
Then, I perform single-linkage clustering and cut the tree at an height just above 1. This will ensure that every member in a cluster will have at least (4 - 1) = 3
elements in common with at least another member of the same cluster.
Finally, I summarize each sub-dataframe by the cluster number gid
.
library(dplyr)
library(tidyr)
library(proxy)
n <- 3
df <- data.frame(a = c("one", "one", "one", "one", "three"),
b = c(1,1,1,2,3),
c = c(2,2,2,7,12),
d = c(6,6,7,8,10),
e = c(1,4,1,3,4))
df |>
group_by(a) |>
group_modify(~{
gid <- if(nrow(.x) > 1)
proxy::dist(.x, method = \(a,b) sum(a != b)) |>
hclust(method="single") |>
cutree(h = 0.1 + ncol(.x) - n)
else
1
group_by(cbind(.x, gid), gid) |>
summarize(across(everything(), mean))
})
# A tibble: 3 × 6
# Groups: a [2]
a gid b c d e
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 one 1 1 2 6.33 2
2 one 2 2 7 8 3
3 three 1 3 12 10 4
Here is an example dendrogram obtained from the first 4 rows:
Upvotes: 2