Reputation: 45

Access data from HDF5 - slice/extract data

I am trying to target specific rows within a large matrix contained in an HDF5 file. I have a .txt file that contains the ids of interest that are present in the HDF5 file and wish to output the corresponding data of those rows - all corresponding data are numerals.

I wrote the following code but the output only contains the ids (single column). I need the remaining data attached to those rows (the data in the subsequent columns). Any advice would be much appreciated!

import os 
import h5py


mydir = os.path.expanduser("~/Desktop/alexs-stuff/")

in_file = mydir + "EMP/EMPopen/full_emp_table_hdf5.h5"
wanted_file = mydir + "EMP/greengenes-curto-only.txt"
out_file = mydir + "EMP/emp-curto-only.txt"

wanted = set()

with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)

hdf5_file = h5py.File(in_file, "r")

count = 0

with open(out_file, "w") as h:
    for keys in hdf5_file["observation"]["ids"]:
        if keys in wanted:
            count = count + 1
            h.write(keys + "\n")

print "Converted %i records" % count 

hdf5_file.close()

If it helps, here is the structure of the hdf5 file:

<HDF5 file "full_emp_table_hdf5.h5" (mode r)> (File) /
 sample     /sample (Group) /sample
     metadata     /sample/metadata (Group) /sample/metadata
     group-metadata     /sample/group-metadata (Group) /sample/group-metadata
     ids     /sample/ids (Dataset) /sample/ids     len = (15481,) object
     matrix     /sample/matrix (Group) /sample/matrix
         indices     /sample/matrix/indices (Dataset) /sample/matrix/indices     len = (107439386,) int32
         indptr     /sample/matrix/indptr (Dataset) /sample/matrix/indptr     len = (15482,) int32
         data     /sample/matrix/data (Dataset) /sample/matrix/data     len = (107439386,) float64
 observation     /observation (Group) /observation
     metadata     /observation/metadata (Group) /observation/metadata
         taxonomy     /observation/metadata/taxonomy (Dataset) /observation/metadata/taxonomy     len = (5594412, 7) object
     group-metadata     /observation/group-metadata (Group) /observation/group-metadata
     ids     /observation/ids (Dataset) /observation/ids     len = (5594412,) object
     matrix     /observation/matrix (Group) /observation/matrix
         indices     /observation/matrix/indices (Dataset) /observation/matrix/indices     len = (107439386,) int32
         indptr     /observation/matrix/indptr (Dataset) /observation/matrix/indptr     len = (5594413,) int32
         data     /observation/matrix/data (Dataset) /observation/matrix/data     len = (107439386,) float64

additional information:

type(hdf5_file['observation']['ids'])
>>> <class 'h5py._hl.dataset.Dataset'>


dir(hdf5_file['observation']['ids'])
>>> ['__array__', '__class__', '__delattr__', '__dict__', '__doc__', '__eq__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_d', '_dcpl', '_e', '_filters', '_id', '_lapl', '_lcpl', '_local', 'astype', 'attrs', 'chunks', 'compression', 'compression_opts', 'dims', 'dtype', 'file', 'fillvalue', 'fletcher32', 'id', 'len', 'maxshape', 'name', 'parent', 'read_direct', 'ref', 'regionref', 'resize', 'scaleoffset', 'shape', 'shuffle', 'size', 'value', 'write_direct']

Upvotes: 3

Answers (4)

Pro Q

Reputation: 5048

This error occurred for me because I was trying to index into a h5py.Dataset.

This usually happens for one of two reasons:

You thought your object was a h5py.Group, and really it was a h5py.Dataset. You can check this by merely printing the type of the thing you're trying to index before indexing it to make sure it's an h5py.Group and not an h5py.Dataset. (i.e. print(type(thing)) before doing thing["key"]. You want it to say h5py.Group, not h5py.Dataset.)
You thought your object was a numpy array, but it was a dataset. In this case, you need to use dataset[:] on your h5py.Dataset to get its numpy array or dataset[()] to get its scalar value.

Upvotes: 0

Alex Chase

Reputation: 45

I did not find an easy answer using python, but if anyone else comes across a similar problem, use:

$biom subset-table -i input_file.biom -a observation -s ids_wanted.txt -o biom

you can then convert this:

$biom convert -i outputfilefromabove.biom -o outputfilefromabove.txt --to_tsv

This method is way easier and takes advantage of the great resources at biom!

Upvotes: 0

hpaulj

Reputation: 231738

I think this hdf5 file was produced by biom-format software, http://biom-format.org, From its pypi package: https://pypi.python.org/packages/source/b/biom-format/biom-format-2.1.2.tar.gz

I can extract a sample hdf5 file: examples/min_sparse_otu_table_hdf5.biom, and process it with:

import numpy as np
from scipy import sparse as sp
import h5py

f = h5py.File('min_sparse_otu_table_hdf5.biom','r')
f.visit(print)  # to see its structure
"""
observation
observation/group-metadata
observation/ids
observation/matrix
observation/matrix/data
observation/matrix/indices
observation/matrix/indptr
observation/metadata
sample
sample/group-metadata
sample/ids
sample/matrix
sample/matrix/data
sample/matrix/indices
sample/matrix/indptr
sample/metadata
"""

Or in more detail:

f.visititems(lambda name,obj:print(name, obj))
"""
...
sample <HDF5 group "/sample" (4 members)>
sample/group-metadata <HDF5 group "/sample/group-metadata" (0 members)>
sample/ids <HDF5 dataset "ids": shape (6,), type "|O4">
sample/matrix <HDF5 group "/sample/matrix" (3 members)>
sample/matrix/data <HDF5 dataset "data": shape (15,), type "<f8">
sample/matrix/indices <HDF5 dataset "indices": shape (15,), type "<i4">
sample/matrix/indptr <HDF5 dataset "indptr": shape (7,), type "<i4">
sample/metadata <HDF5 group "/sample/metadata" (0 members)>
"""

This has a similar structure to your full_emp_table_hdf5.h5

Load arrays with .value (or [:])

sample_ids = f['sample/ids'].value
"""
array([b'Sample1', b'Sample2', b'Sample3', b'Sample4', b'Sample5',
       b'Sample6'], dtype=object)
"""

# create a sparse array from the sample/matrix group
data = f['sample/matrix/data'].value
indices = f['sample/matrix/indices'].value
indptr = f['sample/matrix/indptr'].value
sample = sp.csr_matrix((data,indices,indptr))
# display as a dense array
print(sample.A)

""" 
array([[ 0.,  5.,  0.,  2.,  0.],
       [ 0.,  1.,  0.,  1.,  1.],
       [ 1.,  0.,  1.,  1.,  1.],
       [ 0.,  2.,  4.,  0.,  0.],
       [ 0.,  3.,  0.,  0.,  0.],
       [ 0.,  1.,  2.,  1.,  0.]])
"""

# select a specific row
print(sample[2,:])
"""
  (0, 0)    1.0
  (0, 2)    1.0
  (0, 3)    1.0
  (0, 4)    1.0
"""
print(sample[2,:].A)  # that row in dense format
# array([[ 1.,  0.,  1.,  1.,  1.]])

I can write sample to a text file with:

np.savetxt('min_sample.txt',sample.A, '%.2f')

"""
0.00 5.00 0.00 2.00 0.00
0.00 1.00 0.00 1.00 1.00
1.00 0.00 1.00 1.00 1.00
0.00 2.00 4.00 0.00 0.00
0.00 3.00 0.00 0.00 0.00
0.00 1.00 2.00 1.00 0.00
"""

The biom module probably can do all this and more, but the basics just need the numpy and sparse modules.

Upvotes: 1

vikramls

Reputation: 1822

Change the line:

h.write(keys + "\n")

to this:

h.write(hdf5_file['observation']['ids'][keys] + "\n")

Upvotes: 0

Access data from HDF5 - slice/extract data

Answers (4)

Related Questions