Reputation: 11
I'm currently working on analyzing networks using the igraph package in R (answers in Python are also welcomed). I have several possibly cyclic directed unweighted graphs (no duplicated edges), each containing 1,000 to 1,000,000 of nodes. In each graph, I have identified starting nodes (nodes without incoming edges) and ending nodes (nodes without outgoing edges), and I have already identified all connected start-end pairs.
For example,
Starting nodes are: A and E Ending nodes are: C and F Connected start-end pairs are: (A, C) (A, F) (E, F)
I've successfully obtained the shortest path lengths between each start-end pair using the shortest.paths function in the igraph package. The path length is defined as the number of edges it goes through.
Using the example above, I have:
start_name | end_name | shortest path length
A | C | 2
A | F | 2
E | F | 1
Now, I'm interested in calculating the average simple path length and longest simple path length. My initial approach was to use the all_simple_paths function to
Given the example above, the result should be
start_name | end_name | shortest path length | longest path length | avg path length
A | C | 2 | 3 | 2.5
A | F | 2 | 2 | 2.0
E | F | 1 | 1 | 1.0
However, this method seems to be too time-consuming, especially given the large number of nodes in each graph. I'm concerned about the feasibility of this approach as I've read that the time grows exponentially with the number of nodes. (I'm not a computer science student so I have little knowledge about time complexity).
Therefore, I'm seeking advice on whether this project is feasible and, if so, what would be the expected time for computation? (I know it's nearly impossible to finished in couple of hours. Will it take weeks even months?)
Also, I would appreciate any alternative approaches or optimizations that could help me efficiently calculate the average and longest path lengths in these large directed graphs.
Here is the example code:
# Example code to identify starting and ending nodes
library(tidyverse)
library(igraph)
# Create a sample graph
g_data <- data.frame(from = c("A", "B", "B", "D", "B", "E"),
end = c("B", "C", "D", "C", "F", "F"))
g <- graph_from_data_frame(g_data, directed = TRUE)
V(g)
V(g)$name
plot(g)
# Identify starting nodes (nodes without incoming edges)
starting_nodes <- V(g)[which(degree(g, mode = "in") == 0)]
# Identify ending nodes (nodes without outgoing edges)
ending_nodes <- which(degree(g, mode = "out") == 0)
# Find all connected start-end pairs (vertex id, not name)
connected_pairs <- expand.grid(starting_nodes, ending_nodes)
colnames(connected_pairs) <- c("start", "end")
connected_pairs <- connected_pairs %>% mutate(start_name = V(g)$name[start],
end_name = V(g)$name[end])
# Calculate the shortest path
connected_pairs$shortest_path_length <- sapply(1:nrow(connected_pairs), function(i) {
distances(g, connected_pairs[i,]$start, connected_pairs[i,]$end, mode = "out")
})
# Delete unconncected pairs
connected_pairs <- connected_pairs[connected_pairs$shortest_path_length != Inf, ]
# Longest path length
connected_pairs$longest_path_length <- sapply(1:nrow(connected_pairs), function(i) {
all_paths <- all_simple_paths(g, connected_pairs[i,]$start, connected_pairs[i,]$end, mode = "out")
num_all_paths <- length(all_paths)
all_paths_length <- sapply(1:length(all_paths), function(x) {
length(all_paths[[x]]) - 1
})
return(max(all_paths_length))
})
# Average path length
connected_pairs$avg_path_length <- sapply(1:nrow(connected_pairs), function(i) {
all_paths <- all_simple_paths(g, connected_pairs[i,]$start, connected_pairs[i,]$end, mode = "out")
num_all_paths <- length(all_paths)
all_paths_length <- sapply(1:length(all_paths), function(x) {
length(all_paths[[x]]) - 1
})
return(mean(all_paths_length))
})
Upvotes: 1
Views: 230
Reputation: 101638
You can try the code below
transform(
aggregate(
path_len ~ .,
do.call(
rbind,
unlist(
lapply(
starting_nodes,
\(x) {
lapply(
all_simple_paths(g, x, ending_nodes),
\(p) {
data.frame(
start_name = V(g)$name[p[1]],
end_name = V(g)$name[p[length(p)]],
path_len = length(p) - 1
)
}
)
}
), FALSE, FALSE
)
),
range
),
avg_len = rowMeans(path_len)
)
which gives
start_name end_name path_len.1 path_len.2 avg_len
1 A C 2 3 2.5
2 A F 2 2 2.0
3 E F 1 1 1.0
Upvotes: 0