Reputation: 2652
I have a data set that looks like this
target.id source.id connected
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
1 2 1
2 2 0
3 2 1
Basically I have a source location, destination location and whether they are connected. Here connections are directional, e.g., location 1 can be connected to location 8 while location 8 is not connected to location 1 (think of airline flights, where Atlantis can send flights to Mars, while Mars may not send flights to Atlantis, which would mean Atlantis connects to Mars and Mars does not connect to Atlantis).
I need to identify sets of 'fully' connected locations, where all observations are sources and targets of each other. I need to do it pairwise, 3 by 3, and up to what is feasible, given that I have 75 locations. An example output would be, for 3 by 3, locations 3, 5, and 8 are both sources and targets of each other.
The way I have tried to tackle this was getting all permutations of 1:length(unique(target.id))
2 by 2, 3 by 3, until 8 by 8 (8 by 8 would be the maximum sets I would look at) and then intersect
all of them.
However, obviously, this is too slow. Any better approaches?
Upvotes: 3
Views: 82
Reputation: 44309
It sounds like you want all cliques of size 2 through 8 from a directed graph, where nodes are your ids and edges exist when the source -> destination is labeled as connected in your dataset. The first step would be filtering to just the connected edges, yielding something like the following example data:
(filtered <- data.frame(source.id = c(1, 1, 2, 2, 3, 3, 3, 4, 4), target.id = c(2, 3, 1, 3, 1, 2, 4, 3, 5), connected = 1))
# source.id target.id connected
# 1 1 2 1
# 2 1 3 1
# 3 2 1 1
# 4 2 3 1
# 5 3 1 1
# 6 3 2 1
# 7 3 4 1
# 8 4 3 1
# 9 4 5 1
Next, you can limit your data to id pairs that are connected in both directions:
(bidir <- filtered[duplicated(paste(pmin(filtered$source.id, filtered$target.id),
pmax(filtered$source.id, filtered$target.id))),])
# source.id target.id connected
# 3 2 1 1
# 5 3 1 1
# 6 3 2 1
# 8 4 3 1
In this sample data, the cliques of size 2 are (1, 2), (1, 3), (2, 3), and (3, 4), and the clique of size 3 is (1, 2, 3). The igraph package computes these in "near-optimal time":
library(igraph)
g <- graph.data.frame(bidir, directed=FALSE)
cliques(g, min=2, max=8)
# [[1]]
# + 2/4 vertices, named:
# [1] 2 3
#
# [[2]]
# + 2/4 vertices, named:
# [1] 2 1
#
# [[3]]
# + 2/4 vertices, named:
# [1] 3 4
#
# [[4]]
# + 2/4 vertices, named:
# [1] 3 1
#
# [[5]]
# + 3/4 vertices, named:
# [1] 2 3 1
Upvotes: 3