How to control outlier nodes for network layout algorithms?

Question

Presenting large graph (>10000 nodes; > 10000 edges) using igraph package with Fruchterman-Reingold layout algorithm. Some outlier nodes will make the visualization difficult, 99% nodes huddled together, while 1% outlier nodes located far away. For example, 99.9% nodes locate between 0 and 10, but 0.1% nodes locate outside 10000. The problem is how to control these outlier nodes to present the all nodes.

Here is an example, in which the 0.2% outlier nodes make the full presentation difficult.

> library(igraph)
> set.seed(12)
> ig <- erdos.renyi.game(12000,1/10000,directed=TRUE,loops=FALSE)
> ig.layout <- layout_with_fr(ig)
> apply(ig.layout,2,quantile,c(0,0.001,0.01,0.1,0.9,0.99,0.999,1))
               [,1]         [,2]
0%      -54.7584289   -58.192821
0.1%    -49.8806632   -51.090376
1%      -29.7822097   -33.073435
10%      -0.2196407    -1.170996
90%      10.1564691    10.513665
99%    2026.5245335   737.739440
99.9% 16433.7302032 13168.400710
100%  22614.7986797 22284.309659

G5W · Accepted Answer

One way to "control" the outliers is to get rid of them. This will reduce your initial problem, but you will still be stuck with a big graph that is hard to visualize. But let's deal with one thing at a time. First, the outliers.

Unfortunately, you set the seed after you generated the graph. I will move the set.seed statement first so that the results will be reproducible.

library(igraph)
set.seed(12)
ig <- erdos.renyi.game(12000,1/10000,directed=TRUE,loops=FALSE)
ig.layout <- layout_with_fr(ig)
apply(ig.layout,2,quantile,c(0,0.001,0.01,0.1,0.9,0.99,0.999,1))
               [,1]          [,2]
0%    -5.359639e+01 -9.898871e+01
0.1%  -4.996891e+01 -5.046219e+01
1%    -3.040131e+01 -2.934615e+01
10%   -1.221806e-02  1.513951e-02
90%    1.207328e+01  1.130579e+01
99%    1.111746e+03  6.994646e+02
99.9%  1.418739e+04  1.182382e+04
100%   1.968552e+04  2.025938e+04

I get a result comparable to yours. More to the point, the graph is badly warped by the outliers.

plot(ig, layout=ig.layout, vertex.size=4, vertex.label=NA,
    edge.arrow.size=0.4)

But what are these outliers?

igComp = components(ig)
table(igComp$csize)
    1     2     3     4     5     6     7 10489 
 1041   137    42     8     5     1     1     1

Your graph has one very large component and quite a few small components. The "outliers" are the nodes in the small, disconnected components. My suggestion is that if you want to see the graph, eliminate these small components. Just look at the big component.

C1 = induced_subgraph(ig, which(igComp$membership ==1))

set.seed(12)
C1.layout <- layout_with_fr(C1)
apply(C1.layout,2,quantile,c(0,0.001,0.01,0.1,0.9,0.99,0.999,1))
            [,1]        [,2]
0%    -18.111038 -30.5068075
0.1%  -11.257167 -14.4507491
1%     -4.570292  -3.2830470
10%     0.124789   0.1836629
90%     7.182714   7.1506193
99%    12.291679  13.1523646
99.9%  26.812703  23.6325447
100%   35.186445  26.8564644

Now the layout is more reasonable.

plot(C1, layout=C1.layout, vertex.size=4, vertex.label=NA,
    edge.arrow.size=0.4)

Now the "outliers" are gone and we see the core of the graph. You have a different problem now. It is hard to look at 10500 nodes and make sense of it, but at least you can see this core. I wish you luck with taking the exploration further.

How to control outlier nodes for network layout algorithms?

Answers (1)

Related Questions