AnnanFay
AnnanFay

Reputation: 9739

How to aggregate based on condition?

I'm learning R. I would aggregate data which looks like this:

ID   Score  Elements
231  123    "a,b,c"
132  123    "b,c,d"
321  123    "e"
645  123    "d, f"
321  200    "foo,bar,baz"

I would like to combine all rows where an element matches. The above would result in 3 rows with only the rows 1,2,3 combining because of a shared b/c and d. It's possible to have hundreds of rows which must be combined, or identical rows.

Output should be:

Score  Elements
123    "a,b,c,d,f"
123    "e"
200    "foo,bar,baz"

I'm currently using aggregate on the Score column because this is almost a good group identifier, however in some corner cases there are groups with the same score ('e' in the above). I'm using a custom combiner function to combine the Score values into string vectors. My current code is:

customCombiner <- function(foo) {
    return(unique(unlist(strsplit(paste(as.vector(foo), collapse = ","),','))))
}
result = aggregate(
    myDataFrame$Elements,
    by=list(score=myDataFrame$score),
    customCombiner
)

Is it possible to aggregate rows but first check if they should be aggregated? Or a different solution to my problem?

Upvotes: 2

Views: 1190

Answers (1)

Neal Fultz
Neal Fultz

Reputation: 9687

This is actually a tricky problem; you will need to find the components of an undirected graph, where each row of your data is a node, and with edges between them if they have any overlap of elements.

First, you should get you data cleaned up using strsplit(), which will give you a list of sets, similar to this:

m <- list(c('a','b','c'), c('b','c','d'), 'e', c('d', 'f'), c('foo','baz','bar'))

Then, you can calculate an adjacency matrix using outer and intersect:

adj <- outer(m,m,Vectorize(function(x,y) length(intersect(x,y))))

which is this matrix:

> adj
     [,1] [,2] [,3] [,4] [,5]
[1,]    3    2    0    0    0
[2,]    2    3    0    1    0
[3,]    0    0    1    0    0
[4,]    0    1    0    2    0
[5,]    0    0    0    0    3

Then, using the igraph package, convert the matrix to a graph and extract the components:

cmp <- components(graph.adjacency(adj))

cmp$membership is the assignment of each node to a component:

> cmp$membership
[1] 1 1 2 1 3

You can eg find all the elements of a component using tapply:

> tapply(m, cmp$membership, Reduce, f=union)
$`1`
[1] "a" "b" "c" "d" "f"

$`2`
[1] "e"

$`3`
[1] "foo" "baz" "bar"

Upvotes: 1

Related Questions