Collapse rows by common variable of list

Question

I want to collapse the rows of dataframe to create the orthologe group of each othologe and its corresponding genes.

For example:

Column A	Column B
Ortho1	gene1
Ortho2	gene2, gene3
Ortho3	gene4, gene5, gene6
Ortho4	gene5, gene6
Ortho5	gene6, gene7
Ortho6	gene1, gene8

to be :

Column A	Column B
Ortho1, Ortho6	gene1, gene8
Ortho2	gene2, gene3
Ortho3, Ortho4, Ortho5	gene4, gene5, gene6, gene7

I have tried to merge them, however it requires id, which I do not provide by data. Also for loop to find intersect(). Feels like, there is a simpler way to overcome this bottleneck.

the original data was like

Column A	Column B
Ortho1	gene1
Ortho2	gene2
Ortho2	gene3

...

MrFlick · Accepted Answer

The data is similar to a graph with nodes and edges connecting them. One solution would be to use the igraph package to take care of finding the non-overlapping groups. You can do

library(igraph)
dd %>% 
  tidyr::separate_rows(`Column B`) %>% 
  graph_from_data_frame(vertices=rbind(
    data.frame(v=unique(.$`Column A`), type="ortho"), 
    data.frame(v=unique(.$`Column B`), type="gene"))) %>% 
  decompose() %>% 
  purrr::map_df(function(g) {
    data.frame(
      "Column A" = paste((V(g)$name[V(g)$type=="ortho"]), collapse = ","),
      "Column B" = paste((V(g)$name[V(g)$type=="gene"]), collapse = ",")
    )
  })

Which will return

              Column.A                Column.B
1        Ortho1,Ortho6             gene1,gene8
2               Ortho2             gene2,gene3
3 Ortho3,Ortho4,Ortho5 gene4,gene5,gene6,gene7

Collapse rows by common variable of list

Answers (1)

Related Questions