Reputation: 218
I have a genetic dataset of positions in the genome, I am looking to group rows/genome positions in this dataset depending on connected duplicate information.What I mean by this is:
If I have a dataset of points A, B, C etc.:
Point Connections
A A, B
B B, C
C C, B
D D, E, F, G
I want to group those which all have connections to each other (whether directly or not) by setting a matching group number column for those rows, so for example this dataset groups to:
Point Connections Group
A A, B 1
B B, C 1
C C, B 1
D D, E, F, G 2
#A B and C are all connected to each other so are in the same group, even if A and C are
#not directly connected in the Connections column
#D is the first row seen that is unrelated so is put in a separate group which would also
#include D's connecting letters and any connectors of those letters
A sample of my actual dataset is chromosome positions (CP) where the 1st number is the chromosome and the 2nd number (following a :) is a genome position on that chromosome, so looks like this (real data is ~3000 rows):
CP linked_CPS
1:100 1:100, 1:203
1:102 1:102
1:203 1:100, 1:203, 1:400
1:400 1:400
2:400 2:400, 2:401
2:401 2:401, 2:400
Expected output grouping connected rows:
CP linked_CPS Group
1:100 1:100, 1:203 1
1:203 1:100, 1:203, 1:400 1
1:400 1:400 1
1:102 1:102 2
2:400 2:400, 2:401 3
2:401 2:401, 2:402 3
One thing to note is that different chromosomes (the beginning number 1: or 2: of CP
cannot be in the same group even if the 2nd number is the same, e.g. 1:400
and 2:400
would not be the same group as they are on chromosomes 1 and 2).
Also for context, my final aim is to take the smallest and largest position of each group to set a region distance per group in the genome.
I've seen other questions with a similar basis of grouping matching/duplicate information, but haven't been sure how to apply them to this problem, and I have a biology background so not sure which packages/functions are best. Any help would be appreciated.
Input data:
structure(list(CP = c("1:100", "1:102", "1:203", "1:400", "2:400",
"2:401"), linked_CPS = c("1:100, 1:203", "1:102", "1:100, 1:203, 1:400",
"1:400", "2:400, 2:401", "2:401, 2:402")), row.names = c(NA,
-6L), class = c("data.table", "data.frame"))
Upvotes: 1
Views: 71
Reputation: 4658
If I understand your question correctly, you are looking for connected components in a graph.
The code below turns your data.frame
into a graph and finds these components.
library(tidyverse)
library(tidygraph)
df <- structure(list(CP = c("1:100", "1:102", "1:203", "1:400", "2:400",
"2:401"), linked_CPS = c("1:100, 1:203", "1:102", "1:100, 1:203, 1:400",
"1:400", "2:400, 2:401", "2:401, 2:402")), row.names = c(NA,
-6L), class = c("data.table", "data.frame"))
df %>%
separate_rows(linked_CPS, sep = ", ") %>%
as_tbl_graph() %>%
activate(nodes) %>%
mutate(group = group_components()) %>%
as_tibble()
which gives
# A tibble: 7 x 2
name group
<chr> <int>
1 1:100 1
2 1:102 3
3 1:203 1
4 1:400 1
5 2:400 2
6 2:401 2
7 2:402 2
Upvotes: 2