Reputation: 105
I want to collapse the rows of dataframe to create the orthologe group of each othologe and its corresponding genes.
For example:
Column A | Column B |
---|---|
Ortho1 | gene1 |
Ortho2 | gene2, gene3 |
Ortho3 | gene4, gene5, gene6 |
Ortho4 | gene5, gene6 |
Ortho5 | gene6, gene7 |
Ortho6 | gene1, gene8 |
to be :
Column A | Column B |
---|---|
Ortho1, Ortho6 | gene1, gene8 |
Ortho2 | gene2, gene3 |
Ortho3, Ortho4, Ortho5 | gene4, gene5, gene6, gene7 |
I have tried to merge
them, however it requires id, which I do not provide by data. Also for
loop to find intersect()
. Feels like, there is a simpler way to overcome this bottleneck.
Column A | Column B |
---|---|
Ortho1 | gene1 |
Ortho2 | gene2 |
Ortho2 | gene3 |
...
Upvotes: 1
Views: 45
Reputation: 206197
The data is similar to a graph with nodes and edges connecting them. One solution would be to use the igraph
package to take care of finding the non-overlapping groups. You can do
library(igraph)
dd %>%
tidyr::separate_rows(`Column B`) %>%
graph_from_data_frame(vertices=rbind(
data.frame(v=unique(.$`Column A`), type="ortho"),
data.frame(v=unique(.$`Column B`), type="gene"))) %>%
decompose() %>%
purrr::map_df(function(g) {
data.frame(
"Column A" = paste((V(g)$name[V(g)$type=="ortho"]), collapse = ","),
"Column B" = paste((V(g)$name[V(g)$type=="gene"]), collapse = ",")
)
})
Which will return
Column.A Column.B
1 Ortho1,Ortho6 gene1,gene8
2 Ortho2 gene2,gene3
3 Ortho3,Ortho4,Ortho5 gene4,gene5,gene6,gene7
Upvotes: 2