Jin_soo
Jin_soo

Reputation: 105

Collapse rows by common variable of list

I want to collapse the rows of dataframe to create the orthologe group of each othologe and its corresponding genes.

For example:

Column A Column B
Ortho1 gene1
Ortho2 gene2, gene3
Ortho3 gene4, gene5, gene6
Ortho4 gene5, gene6
Ortho5 gene6, gene7
Ortho6 gene1, gene8

to be :

Column A Column B
Ortho1, Ortho6 gene1, gene8
Ortho2 gene2, gene3
Ortho3, Ortho4, Ortho5 gene4, gene5, gene6, gene7

I have tried to merge them, however it requires id, which I do not provide by data. Also for loop to find intersect(). Feels like, there is a simpler way to overcome this bottleneck.

Column A Column B
Ortho1 gene1
Ortho2 gene2
Ortho2 gene3

...

Upvotes: 1

Views: 45

Answers (1)

MrFlick
MrFlick

Reputation: 206197

The data is similar to a graph with nodes and edges connecting them. One solution would be to use the igraph package to take care of finding the non-overlapping groups. You can do

library(igraph)
dd %>% 
  tidyr::separate_rows(`Column B`) %>% 
  graph_from_data_frame(vertices=rbind(
    data.frame(v=unique(.$`Column A`), type="ortho"), 
    data.frame(v=unique(.$`Column B`), type="gene"))) %>% 
  decompose() %>% 
  purrr::map_df(function(g) {
    data.frame(
      "Column A" = paste((V(g)$name[V(g)$type=="ortho"]), collapse = ","),
      "Column B" = paste((V(g)$name[V(g)$type=="gene"]), collapse = ",")
    )
  })

Which will return

              Column.A                Column.B
1        Ortho1,Ortho6             gene1,gene8
2               Ortho2             gene2,gene3
3 Ortho3,Ortho4,Ortho5 gene4,gene5,gene6,gene7

Upvotes: 2

Related Questions