tdc
tdc

Reputation: 8597

Loading Matlab sparse matrix using Python Pytables

I originally asked a related question here, but didn't really seem to get anywhere. Perhaps if I rephrase part of it more specifically it might help....

I have files stored using Matlab's sparse format (HDF5, csc I believe), and I'm trying to use Pytables to operate on them directly, but haven't succeeded yet. Using h5py I can do the following:

# Method 1: uses h5py (WORKS)
f1 = h5py.File(fname)
data = f1['M']['data']
ir = f1['M']['ir']
jc = f1['M']['jc']
M = scipy.sparse.csc_matrix((data, ir, jc))

but if I try to do the equivalent in Pytables:

# Method 2: uses pyTables (DOESN'T WORK)
f2 = tables.openFile(fname)
data = f2.root.M.data
ir = f2.root.M.ir
jc = f2.root.M.jc
M = scipy.sparse.csc_matrix( (data,ir,jc) )

this fails (after a long wait) with the error:

TypeError                                 Traceback (most recent call last)

/home/tdiethe/BMJ/<ipython console> in <module>()

/usr/lib/python2.6/dist-packages/scipy/sparse/compressed.pyc in __init__(self, arg1, shape, dtype, copy, dims, nzmax)
    56                     self.indices = np.array(indices, copy=copy)
    57                     self.indptr  = np.array(indptr, copy=copy)
---> 58                     self.data    = np.array(data, copy=copy, dtype=getdtype(dtype, data))
    59                 else:
    60                     raise ValueError, "unrecognized %s_matrix constructor usage" %\

/usr/lib/python2.6/dist-packages/scipy/sparse/sputils.pyc in getdtype(dtype, a, default)
    69                 canCast = False
    70             else:
---> 71                 raise TypeError, "could not interpret data type"
    72     else:
    73         newdtype = np.dtype(dtype)

TypeError: could not interpret data type

Looking at f2:

In [63]: f2.root.M.data
Out[63]: 
/M/data (CArray(4753606,), zlib(3)) ''
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (8181,)

In [64]: f2.root.M.ir
Out[64]: 
/M/ir (CArray(4753606,), zlib(3)) ''
  atom := UInt64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (8181,)

In [65]: f2.root.M.jc
Out[65]: 
/M/jc (CArray(133339,), zlib(3)) ''
  atom := UInt64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (7843,)

I have two questions:

Upvotes: 3

Views: 2351

Answers (1)

dtlussier
dtlussier

Reputation: 3119

I missed seeing this in your original post but I think your issue is in the design of PyTables, which provides an extra level of abstraction on top of the underlying data.

Consider the following:

>>> import tables
>>> import numpy as np

>>> h5_file = tables.openFile(fname)
>>> data = f2.root.M.data

At this point data is not a numpy array:

>>> type(data)
tables.array.Array

>>> isinstance(data, np.ndarray)
False

The tables.array.Array does immediately load the underlying array, or immediately expose array-like functionality. This is what led to the error when you tried to use these types of objects to create a sparse array in scipy.

Instead the data object produced by PyTables is intended to provide access to the data through additional commands (i.e. you did that by using fancy indexing [...]). In this approach you can access parts of the data or all of it by doing data[:] or data.read(). It is only at this point that the familiar numpy array is produced.

For more information on the tables.array.Array class see http://pytables.github.com/usersguide/libref.html#the-array-class or the Getting actual data section at http://www.pytables.org/moin/HowToUse for examples of accessing underlying data.

In comparison pyh5 produces much more array-like objects, although still not numpy arrays. Consider:

>>> import pyh5
>>> f1 = h5py.File(fname)
>>> data = f1['M']['data']
>>> type(data)
h5py._hl.dataset.Dataset
>>> isinstance(data, np.ndarray)
>>> False

However, you can immediately do numpy operations on data like you call to scipy, or more simple operations like np.cos(data) or data + np.arange(len(data)). It also appears that data object has some familiar numpy like attributes (i.e. shape) and that the underlying data (a numpy.ndarray) is stored at data.value. However, I'm not familiar with pyh5 as I haven't used it myself, so I'm not sure what the limitations are in this regard.

In general it appears that PyTables and pyh5 have different design goals and therefore should be used in different ways. pyh5 offers a more Numpy-like interface to HDF files, while PyTables offers a lot more complex database like operations. See the discussion of the differences at the pyh5, PyTables docs and the Enthought mailing list:

Upvotes: 3

Related Questions