Reputation: 9739
I'm learning R. I would aggregate data which looks like this:
ID Score Elements
231 123 "a,b,c"
132 123 "b,c,d"
321 123 "e"
645 123 "d, f"
321 200 "foo,bar,baz"
I would like to combine all rows where an element matches. The above would result in 3 rows with only the rows 1,2,3 combining because of a shared b/c and d. It's possible to have hundreds of rows which must be combined, or identical rows.
Output should be:
Score Elements
123 "a,b,c,d,f"
123 "e"
200 "foo,bar,baz"
I'm currently using aggregate
on the Score column because this is almost a good group identifier, however in some corner cases there are groups with the same score ('e' in the above). I'm using a custom combiner function to combine the Score values into string vectors. My current code is:
customCombiner <- function(foo) {
return(unique(unlist(strsplit(paste(as.vector(foo), collapse = ","),','))))
}
result = aggregate(
myDataFrame$Elements,
by=list(score=myDataFrame$score),
customCombiner
)
Is it possible to aggregate rows but first check if they should be aggregated? Or a different solution to my problem?
Upvotes: 2
Views: 1190
Reputation: 9687
This is actually a tricky problem; you will need to find the components of an undirected graph, where each row of your data is a node, and with edges between them if they have any overlap of elements.
First, you should get you data cleaned up using strsplit()
, which will give you a list of sets, similar to this:
m <- list(c('a','b','c'), c('b','c','d'), 'e', c('d', 'f'), c('foo','baz','bar'))
Then, you can calculate an adjacency matrix using outer
and intersect
:
adj <- outer(m,m,Vectorize(function(x,y) length(intersect(x,y))))
which is this matrix:
> adj
[,1] [,2] [,3] [,4] [,5]
[1,] 3 2 0 0 0
[2,] 2 3 0 1 0
[3,] 0 0 1 0 0
[4,] 0 1 0 2 0
[5,] 0 0 0 0 3
Then, using the igraph
package, convert the matrix to a graph and extract the components:
cmp <- components(graph.adjacency(adj))
cmp$membership
is the assignment of each node to a component:
> cmp$membership
[1] 1 1 2 1 3
You can eg find all the elements of a component using tapply
:
> tapply(m, cmp$membership, Reduce, f=union)
$`1`
[1] "a" "b" "c" "d" "f"
$`2`
[1] "e"
$`3`
[1] "foo" "baz" "bar"
Upvotes: 1