Reputation: 5
I'm experiencing something strange while trying to obtain the intersection between two networks/graphs. I found missing nodes when I check the resulting intersection and I wish to understand why this is happening.
Originally I'm working with python 3.5.2 / pandas 0.17.1. on Linux Mint 18, and the dataset and code to reproduce the problem is on the link: Dataset and code
Both tables (Test_01.ncol and Test_02.ncol attached in the link) are edge lists.
First I try to get the intersection of two graph tables with pandas, with merge function:
import pandas as pd
# Load graphs
test_01 = pd.read_csv("Test_01.ncol",sep=" ") # Load Net 1
test_02 = pd.read_csv("Test_02.ncol",sep=" ") # Load Net 2
pandas_intersect = pd.merge(test_01, test_02, how='inner', on=['i1', 'i2']) # Intersection by column
pandas_nodes = len(set(pandas_intersect['i1'].tolist() + pandas_intersect['i2'].tolist())) # Store the number of nodes
And then to check if the merge was done without problems I compared the resulting number of nodes with the resulting nodes of NetworkX intersection as follows:
# Now test with NetworkX
import networkx as nx
n1 = nx.from_pandas_dataframe(test_01, source="i1", target="i2") # Transform net 1 in NetworkX Graph
n2 = nx.from_pandas_dataframe(test_02, source="i1", target="i2") # Transform net 2 in NetworkX Graph
fn = nx.intersection(n1,n2) # NetworkX Intersection
networkx_nodes = len(fn.nodes()) # Store the number of nodes
# The number of nodes are different!!!
pandas_nodes == networkx_nodes
I thought it might be something with the order of nodes, which is not canonical in the tables attached, but even when I put the two dataset on canonical order there are missing nodes.
My next hypothesis is that it might a bug in Pandas or NetworkX, so I try it in R (version 3.3.2) and igraph (version 1.0.1):
library("igraph")
# Read Tables
g1 <- read.table("Test_01.ncol",header=TRUE)
g2 <- read.table("Test_02.ncol",header=TRUE)
# Transform Tables in Graphs
g1 <- graph_from_data_frame(g1, directed=FALSE)
g2 <- graph_from_data_frame(g2, directed=FALSE)
# Create igraph interssection
gi <- graph.intersection(g1,g2)
# Save graph intersection
write.graph(gi,"Test_igraph_intersection.ncol", format="ncol")
# Reload graph intersection
gi_r <- read.graph("Test_igraph_intersection.ncol",format="ncol")
# Prepare result summary
Methods <- c("igraph_intersection","pandas_table_intersection")
Vertex_counts <- c(vcount(gi),vcount(gi_r))
Edge_counts <- c(ecount(gi),ecount(gi_r))
# Create Summary Table
info_data = data.frame(Methods, Vertex_counts, Edge_counts)
colnames(info_data) <- c("Method","Vertices","Edges")
# Check info_data
info_data
But when I take a look into info_data the result was the same.
I understand that the number of nodes may decrease because of intersection procedure, but why this is happening just after I convert it to table format again on python and after save the file and then loading it again with igraph? Or I'm doing something wrong?
If someone can explain whats happening in python or R I appreciate. I really need to understand why this is happening and if I can trust in these intersections to continue my work.
Upvotes: 0
Views: 548
Reputation: 3739
The reason is that the graphs are undirected. intersection
in igraph
and networkx
treats an I--J tie and a J--I tie as equivalent. panda.intersection
will only treat exact matches (i.e. column 1 in data frame A matches column 1 in data frame B and column 2 in data frame A matches column 3 in data frame B).
library(igraph); library(dplyr)
set.seed(1034)
g1 <- sample_gnp(20, 0.25, directed = F)
set.seed(1646)
g2 <- sample_gnp(20, 0.25, directed = F)
V(g1)$name <- sample(LETTERS, 20)
V(g2)$name <- sample(LETTERS, 20)
g1_el <- as.data.frame(as_edgelist(g1), stringsAsFactors = F)
g2_el <- as.data.frame(as_edgelist(g2), stringsAsFactors = F)
g1g2_inter <- as.data.frame(as_edgelist(intersection(g1,g2)))
ij <- inner_join(g1_el, g2_el)
At this point, the two data frames show differing numbers of nodes:
> g1g2_inter
V1 V2
1 X E
2 J Y
3 N J
4 O F
5 H Y
6 T J
7 K N
8 K T
9 P F
10 Q N
> ij
V1 V2
1 T J
2 N J
3 J Y
4 X E
We can get the data frames equal by reversing the order of the columns in one data frame, using inner_join
again. This gets the J--I ties that were missed before. Then full_join
to two the two partial intersections:
g1g2_fj <- g1_el %>%
rename(V1 = V2, V2 = V1) #reverse the column order %>%
inner_join(., g2_el) %>% rename(V1 = V2, V2 = V1) %>%
full_join(., ij) %>% #join with other 'partial' intersection
arrange(V1, V2)
Now, the igraph
intersection matches the fully joined partial intersection:
> g1g2_inter[order(g1g2_inter[,1]),] == g1g2_fj
V1 V2
5 TRUE TRUE
2 TRUE TRUE
7 TRUE TRUE
8 TRUE TRUE
3 TRUE TRUE
4 TRUE TRUE
9 TRUE TRUE
10 TRUE TRUE
6 TRUE TRUE
1 TRUE TRUE
In essence, yes you can trust the intersection methods of networkx
and igraph
. They are doing something a bit different to deal with undirected ties.
Upvotes: 1