Reputation: 21574
I've got a very large network to be read and analyse in Networkx (around 500 million lines), stored in a gzip weighted edgelist (Node1 Node2 Weight). So far I try to read it with:
# Open and Read File
with gzip.open(network,'rb') as fh:
# Read Weighted Edge List
G = nx.read_weighted_edgelist(fh, create_using=nx.DiGraph())
but since it is very large I have some memory issues. I wonder whether there is a way to read the file in "pandas" style along chunks with fixed length. Thanks for your help.
EDIT:
This is a small extraction of my edgelist file (Node1 Node2 Weight):
30879005 5242 11
44608582 2295986 4
24935102 737450 1
42230925 1801294 1
20926179 2332390 1
40959246 1100438 1
3291058 3226104 1
23192021 5818064 1
16328715 7695005 1
11561383 2102983 1
1886716 1378893 2
23192021 5818065 1
2060097 2060091 1
7176482 3222203 2
46586813 1599030 1
35151866 35151866 1
12420680 1364416 5
612044 92878 1
16260783 3373725 1
26475759 85310 1
21149725 17011789 1
1312990 105320 1
23898296 1633222 3
3635610 2103011 1
12737940 4114680 1
18210502 10816500 1
45999903 45999903 1
8689446 1977413 1
5998987 3453478 3
Upvotes: 2
Views: 3937
Reputation: 394269
Read the data in as a csv into a pandas df:
df = pd.read_csv(path_to_edge_list, sep='\s+', header=None, names=['Node1','Node2','Weight'])
Now create a nx DiGraph and perform a list comprehension to generate a list of tuples with (node1, node2, weight) as the data:
In [150]:
import networkx as nx
G = nx.DiGraph()
G.add_weighted_edges_from([tuple(x) for x in df.values])
G.edges()
Out[150]:
[(16328715, 7695005),
(42230925, 1801294),
(40959246, 1100438),
(12737940, 4114680),
(3635610, 2103011),
(16260783, 3373725),
(45999903, 45999903),
(7176482, 3222203),
(8689446, 1977413),
(11561383, 2102983),
(21149725, 17011789),
(18210502, 10816500),
(3291058, 3226104),
(23898296, 1633222),
(46586813, 1599030),
(2060097, 2060091),
(5998987, 3453478),
(44608582, 2295986),
(12420680, 1364416),
(612044, 92878),
(30879005, 5242),
(23192021, 5818064),
(23192021, 5818065),
(1312990, 105320),
(20926179, 2332390),
(26475759, 85310),
(24935102, 737450),
(35151866, 35151866),
(1886716, 1378893)]
Proof we have weight attributes:
In [153]:
G.get_edge_data(30879005,5242)
Out[153]:
{'weight': 11}
To read the edge list in chunks set the chunksize
param in read_csv
and add the edges and weights using the above code for each chunk.
EDIT
So to read in chunks you can do this:
import networkx as nx
G = nx.DiGraph()
for d in pd.read_csv(path_to_edge_list,sep='\s+', header=None, names=['Node1', 'Node2', 'Weight'], chunksize=10000):
G.add_weighted_edges_from([tuple(x) for x in d.values])
Upvotes: 5