Why there are missing nodes after graph intersection - NetworkX, igraph, python and r

Question

I'm experiencing something strange while trying to obtain the intersection between two networks/graphs. I found missing nodes when I check the resulting intersection and I wish to understand why this is happening.

Originally I'm working with python 3.5.2 / pandas 0.17.1. on Linux Mint 18, and the dataset and code to reproduce the problem is on the link: Dataset and code

Both tables (Test_01.ncol and Test_02.ncol attached in the link) are edge lists.

First I try to get the intersection of two graph tables with pandas, with merge function:

import pandas as pd

# Load graphs
test_01 = pd.read_csv("Test_01.ncol",sep=" ") # Load Net 1
test_02 = pd.read_csv("Test_02.ncol",sep=" ") # Load Net 2
pandas_intersect = pd.merge(test_01, test_02, how='inner', on=['i1', 'i2']) # Intersection by column

pandas_nodes = len(set(pandas_intersect['i1'].tolist() + pandas_intersect['i2'].tolist())) # Store the number of nodes

And then to check if the merge was done without problems I compared the resulting number of nodes with the resulting nodes of NetworkX intersection as follows:

# Now test with NetworkX
import networkx as nx
n1 = nx.from_pandas_dataframe(test_01, source="i1", target="i2") # Transform net 1 in NetworkX Graph
n2 = nx.from_pandas_dataframe(test_02, source="i1", target="i2") # Transform net 2 in NetworkX Graph
fn = nx.intersection(n1,n2)  # NetworkX Intersection

networkx_nodes = len(fn.nodes()) # Store the number of nodes

# The number of nodes are different!!!
pandas_nodes == networkx_nodes

I thought it might be something with the order of nodes, which is not canonical in the tables attached, but even when I put the two dataset on canonical order there are missing nodes.

My next hypothesis is that it might a bug in Pandas or NetworkX, so I try it in R (version 3.3.2) and igraph (version 1.0.1):

library("igraph")

# Read Tables
g1 <- read.table("Test_01.ncol",header=TRUE)
g2 <- read.table("Test_02.ncol",header=TRUE)

# Transform Tables in Graphs
g1 <- graph_from_data_frame(g1, directed=FALSE)
g2 <- graph_from_data_frame(g2, directed=FALSE)

# Create igraph interssection
gi <- graph.intersection(g1,g2)

# Save graph intersection
write.graph(gi,"Test_igraph_intersection.ncol", format="ncol")

# Reload graph intersection
gi_r <- read.graph("Test_igraph_intersection.ncol",format="ncol")

# Prepare result summary
Methods <- c("igraph_intersection","pandas_table_intersection")
Vertex_counts <- c(vcount(gi),vcount(gi_r))
Edge_counts <- c(ecount(gi),ecount(gi_r))

# Create Summary Table
info_data = data.frame(Methods, Vertex_counts, Edge_counts)
colnames(info_data) <- c("Method","Vertices","Edges")

# Check info_data
info_data

But when I take a look into info_data the result was the same.

I understand that the number of nodes may decrease because of intersection procedure, but why this is happening just after I convert it to table format again on python and after save the file and then loading it again with igraph? Or I'm doing something wrong?

If someone can explain whats happening in python or R I appreciate. I really need to understand why this is happening and if I can trust in these intersections to continue my work.

paqmo · Accepted Answer

The reason is that the graphs are undirected. intersection in igraph and networkx treats an I--J tie and a J--I tie as equivalent. panda.intersection will only treat exact matches (i.e. column 1 in data frame A matches column 1 in data frame B and column 2 in data frame A matches column 3 in data frame B).

library(igraph); library(dplyr)
set.seed(1034)
g1 <- sample_gnp(20, 0.25, directed = F)
set.seed(1646)
g2 <- sample_gnp(20, 0.25, directed = F)
V(g1)$name <- sample(LETTERS, 20)
V(g2)$name <- sample(LETTERS, 20)

g1_el <- as.data.frame(as_edgelist(g1), stringsAsFactors = F)
g2_el <- as.data.frame(as_edgelist(g2), stringsAsFactors = F)
g1g2_inter <- as.data.frame(as_edgelist(intersection(g1,g2)))
ij <- inner_join(g1_el, g2_el)

At this point, the two data frames show differing numbers of nodes:

> g1g2_inter
   V1 V2
1   X  E
2   J  Y
3   N  J
4   O  F
5   H  Y
6   T  J
7   K  N
8   K  T
9   P  F
10  Q  N

> ij
  V1 V2
1  T  J
2  N  J
3  J  Y
4  X  E

We can get the data frames equal by reversing the order of the columns in one data frame, using inner_join again. This gets the J--I ties that were missed before. Then full_join to two the two partial intersections:

g1g2_fj <- g1_el %>% 
      rename(V1 = V2, V2 = V1) #reverse the column order %>% 
      inner_join(., g2_el) %>% rename(V1 = V2, V2 = V1) %>% 
      full_join(., ij) %>%  #join with other 'partial' intersection 
      arrange(V1, V2)

Now, the igraph intersection matches the fully joined partial intersection:

> g1g2_inter[order(g1g2_inter[,1]),] == g1g2_fj
     V1   V2
5  TRUE TRUE
2  TRUE TRUE
7  TRUE TRUE
8  TRUE TRUE
3  TRUE TRUE
4  TRUE TRUE
9  TRUE TRUE
10 TRUE TRUE
6  TRUE TRUE
1  TRUE TRUE

In essence, yes you can trust the intersection methods of networkx and igraph. They are doing something a bit different to deal with undirected ties.

Why there are missing nodes after graph intersection - NetworkX, igraph, python and r

Answers (1)

Related Questions