fratajcz
fratajcz

Reputation: 125

Read Graph from multiple files in IGraph (Python)

I have multiple node- and edgelists which form a large graph, lets call that the maingraph. My current strategy is to first read all the nodelists and import it with add_vertices. Every node then gets an internal id which depends on the order they are ingested and therefore isnt very reliable (as i've read it, if you delete one, all higher ids than the one deleted change). I assign every node a name attribute which corresponds to the external ID I use so I can keep track of my nodes between frameworks and a type attribute.

Now, how do I add the edges? When I read an edgelist it will start making a new graph (subgraph) and hence starts the internal ID at 0. Therefore, "merging" the graphs with maingraph.add_edges(subgraph.get_edgelist) inevitably fails.

It is possible to work around this and use the name attribute from both maingraph and subgraph to find out which internal ID each edges' incident nodes have in the maingraph:

def _get_real_source_and_target_id(edge):
    ''' takes an edge from the to-be-added subgraph and gets the ids of the corresponding nodes in the
    maingraph by their name '''
    source_id = maingraph.vs.select(name_eq=subgraph.vs[edge[0]]["name"])[0].index
    target_id = maingraph.vs.select(name_eq=subgraph.vs[edge[1]]["name"])[0].index
    return (source_id,target_id)

And then I tried

edgelist = [_get_real_source_and_target_id(x) for x in subgraph.get_edgelist()]
maingraph.add_edges(edgelist)

But that is hoooooorribly slow. The graph has millions of nodes and edges, which takes 10 seconds to load with the fast, but incorrect maingraph.add_edges(subgraph.get_edgelist) approach. with the correct approach explained above, it takes minutes (I usually stop it after 5 minutes o so). I will have to do this tens of thousands of times. I switched from NetworkX to Igraph because of the fast loading, but it doesn't really help if I have to do it like this.

Does anybody have a more clever way to do this? Any help much appreciated!

Thanks!

Upvotes: 0

Views: 300

Answers (1)

fratajcz
fratajcz

Reputation: 125

Nevermind, I figured out that the mistake was elsewhere. I used numpy.loadtxt() to read the node's names as strings, which somehow did funny stuff when the names were incrementing numbers with more than five figures (see my issue report here). Therefore the above solution got stuck when it tried to get the nodes where numpy messed up the node name. maingraph.vs.select(name_eq=subgraph.vs[edge[0]]["name"])[0].index simply sat there when it couldnt find the node. Now I use pandas to read the node names and it works fine.

The solution above is still ~10x faster than my previous NetworkX solution, so I will just leave it helps someone. Feel free to delete it otherwise.

Upvotes: 1

Related Questions