Simd
Simd

Reputation: 21343

Converting a 1.2GB list of edges into a sparse matrix

I have a 1.2GB list of edges from a graph in a text file. My ubuntu PC has 8GB of RAM. Each line in the input looks like

287111206 357850135

I would like to convert it into a sparse adjacency matrix and output that to a file.

Some statistics for my data:

Number of edges: around 62500000
Number of vertices: around 31250000

I asked much the same question before at https://stackoverflow.com/a/38667644/2179021 and got a great answer. The problem is that I can't get it to work.

I first tried np.loadtxt to load in the file but it was very slow and used a huge amount of memory. So instead I moved to pandas.read_csv which is very fast but this caused it own problems. This is my current code:

import pandas
import numpy as np
from scipy import sparse

data = pandas.read_csv("edges.txt", sep=" ", header= None, dtype=np.uint32)
A = data.as_matrix()
print type(A)
k1,k2,k3=np.unique(A,return_inverse=True,return_index=True)
rows,cols=k3.reshape(A.shape).T
M=sparse.coo_matrix((np.ones(rows.shape,int),(rows,cols)))
print type(M)

The problem is that the pandas dataframe data is huge and I am effectively making a copy in A which is inefficient. However things are even worse as the code crashes with

<type 'instancemethod'>
Traceback (most recent call last):
  File "make-sparse-matrix.py", line 13, in <module>
    rows,cols=k3.reshape(A.shape).T
AttributeError: 'function' object has no attribute 'shape'
raph@raph-desktop:~/python$ python make-sparse-matrix.py 
<type 'numpy.ndarray'>
Traceback (most recent call last):
  File "make-sparse-matrix.py", line 12, in <module>
    k1,k2,k3=np.unique(A,return_inverse=True,return_index=True)
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/arraysetops.py", line 209, in unique
    iflag = np.cumsum(flag) - 1
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 2115, in cumsum
    return cumsum(axis, dtype, out)
MemoryError

So my questions are:

  1. Can I avoid having both the 1.2GB pandas dataframe and the 1.2GB numpy array copy in memory?
  2. Is there some way to get the code to complete in 8GB of RAM?

You can reproduce a test input of the size I am trying to process with:

import random
#Number of edges, vertices
m = 62500000
n = m/2
for i in xrange(m):
    fromnode = str(random.randint(0, n-1)).zfill(9)
    tonode = str(random.randint(0, n-1)).zfill(9)
    print fromnode, tonode

Update

I have now tried a number of different approaches, all of which have failed. Here is a summary.

  1. Using igraph with g = Graph.Read_Ncol('edges.txt'). This uses a huge amount of RAM which crashes my computer.
  2. Using networkit with G= networkit.graphio.readGraph("edges.txt", networkit.Format.EdgeList, separator=" ", continuous=False). This uses a huge amount of RAM which crashes my computer.
  3. The code above in this question but using np.loadtxt("edges.txt") instead of pandas. This uses a huge amount of RAM which crashes my computer.

I then wrote separate code which remapped all the vertex names to number from 1..|V| where |V| is the total number of vertices. This should save the code that imports the edge list from having to build up a table that maps the vertex names. Using this I tried:

  1. Using this new remapped edge list file I used igraph again with g = Graph.Read_Edgelist("edges-contig.txt"). This now works although it takes 4GB of RAM (which is way more than the theoretical amount it should). However, there is no igraph function to write out a sparse adjacency matrix from a graph. The recommended solution is to convert the graph to a coo_matrix. Unfortunately this uses a huge amount of RAM which crashes my computer.
  2. Using the remapped edge list file I used networkit with G = networkit.readGraph("edges-contig.txt", networkit.Format.EdgeListSpaceOne). This also works using less than the 4GB that igraph needs. networkit also comes with a function to write Matlab files (which is a form of sparse adjacency matrix that scipy can read). However networkit.graphio.writeMat(G,"test.mat") uses a huge amount of RAM which crashes my computer.

Finally sascha's answer below does complete but takes about 40 minutes.

Upvotes: 13

Views: 4916

Answers (5)

ead
ead

Reputation: 34357

In my answer I consider the case where the ids of the nodes are given by 9 characters long strings each character from [0-9A-Za-z]. n of these node ids should be mapped on the values [0,n-1] (which might be not necessary for your application, but still of general interest).

The next considerations, I'm sure you are aware of, are here for the sake of completeness:

  1. Memory is the bottle neck.
  2. There are around 10^8 strings in the file.
  3. a 9 character long string + int32 value pair costs around 120 bytes in a dictionary, resulting in 12GB memory usage for the file.
  4. a string id from file can be mapped onto an int64: there are 62 different characters -> can be encoded with 6 bits, 9 characters in the string -> 6*9=54<64 bit. See also toInt64() method further below.
  5. there are int64+int32=12 byte "real" information => ca. 1.2 GB could be enough, however the cost for such a pair in a dictionary is around 60 bytes (around 6 GB RAM needed).
  6. Creating small objects (on the heap) results in a lot of memory overhead, thus bundling these objects in arrays is advantageous. Interesting information about memory used by python objects can be found in his tutorial stile article. Interesting experiences with reducing the memory usage are made public in this blog entry.
  7. python-list is out of question as data structure as well as dictionary. array.array could be alternative, but we use np.array (because there are sorting algorithms for np.array but not array.array).

1. step: reading file and mapping strings to int64. It is a pain to let a np.array grow dynamically, so we assume we now the number of edges in the file (it would be nice to have it in the header, but it can also be deduced from the file size):

import numpy as np

def read_nodes(filename, EDGE_CNT):   
    nodes=np.zeros(EDGE_CNT*2, dtype=np.int64)
    cnt=0
    for line in open(filename,"r"):
        nodes[cnt:cnt+2]=map(toInt64, line.split())  # use map(int, line.split()) for cases without letters
    return nodes

2. step: converting the int64-values into values [0,n-1]:

Possibility A, needs 3*0.8GB:

def maps_to_ids(filename, EDGE_CNT):
""" return number of different node ids, and the mapped nodes"""
    nodes=read_nodes(filename, EDGE_CNT)
    unique_ids, nodes = np.unique(nodes, return_index=True)  
    return (len(unique_ids), nodes)

Possibility B, needs 2*0.8GB, but is somewhat slower:

def maps_to_ids(filename, EDGE_CNT):
    """ return number of different node ids, and the mapped nodes"""
    nodes=read_nodes(filename, EDGE_CNT)
    unique_map = np.unique(nodes)
    for i in xrange(len(nodes)):
        node_id=np.searchsorted(unique_map, nodes[i]) # faster than bisect.bisect
        nodes[i]=node_id  
    return (len(unique_map), nodes)  

3. step: put it all into coo_matrix:

from scipy import sparse
def data_as_coo_matrix(filename, EDGE_CNT)
    node_cnt, nodes = maps_to_ids(filename, EDGE_CNT)    
    rows=nodes[::2]#it is only a view, not a copy
    cols=nodes[1::2]#it is only a view, not a copy

    return sparse.coo_matrix((np.ones(len(rows), dtype=bool), (rows, cols)), shape=(node_cnt, node_cnt))

For calling data_as_coo_matrix("data.txt", 62500000), memory need peaks at 2.5GB (but with int32 instead of int64 only 1.5GB are needed). It took around 5 minutes on my machine, but my machine is pretty slow...

So what is different from your solution?

  1. I get only unique values from np.unique (and not all the indices and the inverse), so there is some memory saved - I can replace the old ids with the new in-place.
  2. I have no experience with pandas so maybe there is some copying involved between pandas<->numpy data structures?

What is the difference from sascha's solution?

  1. There is no need for the list sorted all the time - it is enough to sort after all items are in the list, that is what np.unique() does. sascha's solution keep the list sorted the whole time - you have to pay for this with at least a constant factor, even if the running time stays O(n log(n)). I assumed, an add operation would be O(n), but as pointed out it is O(log(n).

What is the difference to GrantJ's solution?

  1. The size of the resulting sparse matrix is NxN - with N - number of different nodes and not 2^54x2^54 (with very many empty rows and column).

PS:
Here is my idea, how the 9 char string id can be mapped on an int64 value, but I guess this function could become a bottle neck the way it is written and should get optimized.

def toInt64(string):
    res=0L
    for ch in string:
        res*=62
        if ch <='9':
          res+=ord(ch)-ord('0')
        elif ch <='Z':
          res+=ord(ch)-ord('A')+10
        else:
          res+=ord(ch)-ord('a')+36
    return res

Upvotes: 4

GrantJ
GrantJ

Reputation: 8719

Here's my solution:

import numpy as np
import pandas as pd
import scipy.sparse as ss

def read_data_file_as_coo_matrix(filename='edges.txt'):
    "Read data file and return sparse matrix in coordinate format."
    data = pd.read_csv(filename, sep=' ', header=None, dtype=np.uint32)
    rows = data[0]  # Not a copy, just a reference.
    cols = data[1]
    ones = np.ones(len(rows), np.uint32)
    matrix = ss.coo_matrix((ones, (rows, cols)))
    return matrix

Pandas does the heavy lifting of parsing using read_csv. And Pandas is already storing the data in columnar format. The data[0] and data[1] just get references, no copies. Then I feed those to coo_matrix. Benchmarked locally:

In [1]: %timeit -n1 -r5 read_data_file_as_coo_matrix()
1 loop, best of 5: 14.2 s per loop

Then to save a csr-matrix to a file:

def save_csr_matrix(filename, matrix):
    """Save compressed sparse row (csr) matrix to file.

    Based on http://stackoverflow.com/a/8980156/232571

    """
    assert filename.endswith('.npz')
    attributes = {
        'data': matrix.data,
        'indices': matrix.indices,
        'indptr': matrix.indptr,
        'shape': matrix.shape,
    }
    np.savez(filename, **attributes)

Benchmarked locally:

In [3]: %timeit -n1 -r5 save_csr_matrix('edges.npz', matrix.tocsr())
1 loop, best of 5: 13.4 s per loop

And later load it back from a file:

def load_csr_matrix(filename):
    """Load compressed sparse row (csr) matrix from file.

    Based on http://stackoverflow.com/a/8980156/232571

    """
    assert filename.endswith('.npz')
    loader = np.load(filename)
    args = (loader['data'], loader['indices'], loader['indptr'])
    matrix = ss.csr_matrix(args, shape=loader['shape'])
    return matrix

Benchmarked locally:

In [4]: %timeit -n1 -r5 load_csr_matrix('edges.npz')
1 loop, best of 5: 881 ms per loop

And finally test it all:

def test():
    "Test data file parsing and matrix serialization."
    coo_matrix = read_data_file_as_coo_matrix()
    csr_matrix = coo_matrix.tocsr()
    save_csr_matrix('edges.npz', csr_matrix)
    loaded_csr_matrix = load_csr_matrix('edges.npz')
    # Comparison based on http://stackoverflow.com/a/30685839/232571
    assert (csr_matrix != loaded_csr_matrix).nnz == 0

if __name__ == '__main__':
    test()

When running test(), it takes about 30 seconds:

$ time python so_38688062.py 
real    0m30.401s
user    0m27.257s
sys     0m2.759s

And the memory high-water mark was ~1.79 GB.

Note that once you've converted the "edges.txt" to "edges.npz" in CSR-matrix format, loading it will take less than a second.

Upvotes: 14

Walter
Walter

Reputation: 31

I was trying the different methods available apart from the ones already used. I found the following doing good.

Method 1 - Reading the file into a string, parsing the string into a 1-D array using numpy's fromstring.

import numpy as np
import scipy.sparse as sparse

def readEdges():
    with open('edges.txt') as f:
        data = f.read()  
    edges = np.fromstring(data, dtype=np.int32, sep=' ')
    edges = np.reshape(edges, (edges.shape[0]/2, 2))
    ones = np.ones(len(edges), np.uint32)
    cooMatrix = sparse.coo_matrix((ones, (edges[:,0], edges[:,1])))
%timeit -n5 readEdges()

Output:

5 loops, best of 3: 13.6 s per loop

Method 2 - Same as method 1 except instead of loading the file into a string, using the memory mapped interface.

def readEdgesMmap():
    with open('edges.txt') as f:
        with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)) as m: 
            edges = np.fromstring(m, dtype=np.int32, sep=' ')
            edges = np.reshape(edges, (edges.shape[0]/2, 2))
            ones = np.ones(len(edges), np.uint32)
            cooMatrix = sparse.coo_matrix((ones, (edges[:,0], edges[:,1])))
%timeit -n5 readEdgesMmap()

Output:

5 loops, best of 3: 12.7 s per loop

Monitored using /usr/bin/time, both methods use a maximum of around ~2GB of memory.

Few notes:

  1. It seems to do slightly better than pandas read_csv. Using pandas read_csv, the output on the same machine is

    5 loops, best of 3: 16.2 s per loop

  2. Conversion from COO to CSR/CSC consumes significant time too. In @GrantJ's answer, it takes less time because the COO matrix initialization is incorrect. The argument needs to be given as a tuple. I wanted to leave a comment there but I don't have commenting rights yet.

  3. My guess on why this is slightly better than pandas read_csv is the prior assumption of 1D data.

Upvotes: 3

sascha
sascha

Reputation: 33532

Updated version

As indicated in the comments, the approach did not fit your use-case. Let's make some changes:

  • use pandas for reading in the data (instead of numpy: i'm quite surprised np.loadtxt is performing that bad!)
  • use external library sortedcontainers for a more memory-efficient approach (instead of a dictionary)
  • the basic approach is the same

This approach will take ~45 minutes (which is slow; but you could pickle/save the result so you need to do it only once) and ~5 GB of memory to prepare the sparse-matrix for your data, generated with:

import random
N = 62500000
for i in xrange(N):
    print random.randint(10**8,10**9-1), random.randint(10**8,10**9-1)

Code

import numpy as np
from scipy.sparse import coo_matrix
import pandas as pd
from sortedcontainers import SortedList
import time

# Read data
# global memory usage after: one big array
df = pd.read_csv('EDGES.txt', delimiter=' ', header=None, dtype=np.uint32)
data = df.as_matrix()
df = None
n_edges = data.shape[0]

# Learn mapping to range(0, N_VERTICES)  # N_VERTICES unknown
# global memory usage after: one big array + one big searchtree
print('fit mapping')
start = time.time()
observed_vertices = SortedList()
mappings = np.arange(n_edges*2, dtype=np.uint32)  # upper bound on vertices
for column in range(data.shape[1]):
    for row in range(data.shape[0]):
        # double-loop: slow, but easy to understand space-complexity
        val = data[row, column]
        if val not in observed_vertices:
            observed_vertices.add(val)
mappings = mappings[:len(observed_vertices)]
n_vertices = len(observed_vertices)
end = time.time()
print(' secs: ', end-start)

print('transform mapping')
# Map original data (in-place !)
# global memory usage after: one big array + one big searchtree(can be deleted!)
start = time.time()
for column in range(data.shape[1]):
    for row in range(data.shape[0]):
        # double-loop: slow, but easy to understand space-complexity
        val = data[row, column]
        mapper_pos = observed_vertices.index(val)
        data[row, column] = mappings[mapper_pos]
end = time.time()
print(' secs: ', end-start)
observed_vertices = None  # if not needed anymore
mappings = None  # if not needed anymore

# Create sparse matrix (only caring about a single triangular part for now)
# if needed: delete dictionary before as it's not needed anymore!
sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices))

First version

Here is a very simple and very inefficient (in regards to time and space) code to build this sparse matrix. I post this code, because i believe it is important to understand the core parts if one is using these in something bigger.

Let's see, if this code is efficient enough for your use-case or if it needs work. From distance it's hard to tell, because we don't have your data.

The dictionary-part, used for the mapping, is a candidate to blow up your memory. But it's pointless to optimize this without knowing if it's needed at all. Especially because this part of the code is dependent on the number of vertices in your graph (and i don't have any knowledge of this cardinality).

""" itertools.count usage here would need changes for py2 """

import numpy as np
from itertools import count
from scipy.sparse import coo_matrix


# Read data
# global memory usage after: one big array
data = np.loadtxt('edges.txt', np.uint32)
n_edges = data.shape[0]
#print(data)
#print(data.shape)

# Learn mapping to range(0, N_VERTICES)  # N_VERTICES unknown
# global memory usage after: one big array + one big dict 
index_gen = count()
mapper = {}
for column in range(data.shape[1]):
    for row in range(data.shape[0]):
        # double-loop: slow, but easy to understand space-complexity
        val = data[row, column]
        if val not in mapper:
            mapper[val] = next(index_gen)
n_vertices = len(mapper)

# Map original data (in-place !)
# global memory usage after: one big array + one big dict (can be deleted!)
for column in range(data.shape[1]):
    for row in range(data.shape[0]):
        # double-loop: slow, but easy to understand space-complexity
        data[row, column] = mapper[data[row, column]]
#print(data)

# Create sparse matrix (only caring about a single triangular part for now)
# if needed: delete dictionary before as it's not needed anymore!
sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices))
#print(sp_mat)

Output for edges-10.txt:

[[287111206 357850135]
 [512616930 441657273]
 [530905858 562056765]
 [524113870 320749289]
 [149911066 964526673]
 [169873523 631128793]
 [646151040 986572427]
 [105290138 382302570]
 [194873438 968653053]
 [912211115 195436728]]
(10, 2)
[[ 0 10]
 [ 1 11]
 [ 2 12]
 [ 3 13]
 [ 4 14]
 [ 5 15]
 [ 6 16]
 [ 7 17]
 [ 8 18]
 [ 9 19]]
  (0, 10)   True
  (1, 11)   True
  (2, 12)   True
  (3, 13)   True
  (4, 14)   True
  (5, 15)   True
  (6, 16)   True
  (7, 17)   True
  (8, 18)   True
  (9, 19)   True

Upvotes: 3

maxymoo
maxymoo

Reputation: 36555

You might want to take a look at the igraph project, this is a GPL library of C code which is designed for this kind of thing, and has a nice Python API. I think in your case you the Python code would be something like

from igraph import Graph
g = Graph.Read_Edgelist('edges.txt')
g.write_adjacency('adjacency_matrix.txt')

Upvotes: 0

Related Questions