DN1
DN1

Reputation: 218

How to group rows based on connecting or duplicate information?

I have a genetic dataset of positions in the genome, I am looking to group rows/genome positions in this dataset depending on connected duplicate information.What I mean by this is:

If I have a dataset of points A, B, C etc.:

Point Connections
A       A, B
B       B, C
C       C, B
D       D, E, F, G

I want to group those which all have connections to each other (whether directly or not) by setting a matching group number column for those rows, so for example this dataset groups to:

Point Connections     Group
A       A, B            1
B       B, C            1
C       C, B            1 
D       D, E, F, G      2

#A B and C are all connected to each other so are in the same group, even if A and C are 
#not directly connected in the Connections column
#D is the first row seen that is unrelated so is put in a separate group which would also
#include D's connecting letters and any connectors of those letters

A sample of my actual dataset is chromosome positions (CP) where the 1st number is the chromosome and the 2nd number (following a :) is a genome position on that chromosome, so looks like this (real data is ~3000 rows):

CP        linked_CPS
1:100    1:100, 1:203
1:102    1:102
1:203    1:100, 1:203, 1:400
1:400    1:400
2:400    2:400, 2:401
2:401    2:401, 2:400

Expected output grouping connected rows:

CP        linked_CPS          Group
1:100    1:100, 1:203           1
1:203    1:100, 1:203, 1:400    1
1:400    1:400                  1
1:102    1:102                  2
2:400    2:400, 2:401           3
2:401    2:401, 2:402           3

One thing to note is that different chromosomes (the beginning number 1: or 2: of CP cannot be in the same group even if the 2nd number is the same, e.g. 1:400 and 2:400 would not be the same group as they are on chromosomes 1 and 2).

Also for context, my final aim is to take the smallest and largest position of each group to set a region distance per group in the genome.

I've seen other questions with a similar basis of grouping matching/duplicate information, but haven't been sure how to apply them to this problem, and I have a biology background so not sure which packages/functions are best. Any help would be appreciated.

Input data:

structure(list(CP = c("1:100", "1:102", "1:203", "1:400", "2:400", 
"2:401"), linked_CPS = c("1:100, 1:203", "1:102", "1:100, 1:203, 1:400", 
"1:400", "2:400, 2:401", "2:401, 2:402")), row.names = c(NA, 
-6L), class = c("data.table", "data.frame"))

Upvotes: 1

Views: 71

Answers (1)

Bas
Bas

Reputation: 4658

If I understand your question correctly, you are looking for connected components in a graph.

The code below turns your data.frame into a graph and finds these components.

library(tidyverse)
library(tidygraph)

df <- structure(list(CP = c("1:100", "1:102", "1:203", "1:400", "2:400", 
                      "2:401"), linked_CPS = c("1:100, 1:203", "1:102", "1:100, 1:203, 1:400", 
                                               "1:400", "2:400, 2:401", "2:401, 2:402")), row.names = c(NA, 
                                                                                                        -6L), class = c("data.table", "data.frame"))

df %>% 
  separate_rows(linked_CPS, sep = ", ") %>% 
  as_tbl_graph() %>% 
  activate(nodes) %>% 
  mutate(group = group_components()) %>% 
  as_tibble()

which gives

# A tibble: 7 x 2
  name  group
  <chr> <int>
1 1:100     1
2 1:102     3
3 1:203     1
4 1:400     1
5 2:400     2
6 2:401     2
7 2:402     2

Upvotes: 2

Related Questions