Reputation: 45
I am trying to target specific rows within a large matrix contained in an HDF5 file. I have a .txt file that contains the ids of interest that are present in the HDF5 file and wish to output the corresponding data of those rows - all corresponding data are numerals.
I wrote the following code but the output only contains the ids (single column). I need the remaining data attached to those rows (the data in the subsequent columns). Any advice would be much appreciated!
import os
import h5py
mydir = os.path.expanduser("~/Desktop/alexs-stuff/")
in_file = mydir + "EMP/EMPopen/full_emp_table_hdf5.h5"
wanted_file = mydir + "EMP/greengenes-curto-only.txt"
out_file = mydir + "EMP/emp-curto-only.txt"
wanted = set()
with open(wanted_file) as f:
for line in f:
line = line.strip()
if line != "":
wanted.add(line)
hdf5_file = h5py.File(in_file, "r")
count = 0
with open(out_file, "w") as h:
for keys in hdf5_file["observation"]["ids"]:
if keys in wanted:
count = count + 1
h.write(keys + "\n")
print "Converted %i records" % count
hdf5_file.close()
If it helps, here is the structure of the hdf5 file:
<HDF5 file "full_emp_table_hdf5.h5" (mode r)> (File) /
sample /sample (Group) /sample
metadata /sample/metadata (Group) /sample/metadata
group-metadata /sample/group-metadata (Group) /sample/group-metadata
ids /sample/ids (Dataset) /sample/ids len = (15481,) object
matrix /sample/matrix (Group) /sample/matrix
indices /sample/matrix/indices (Dataset) /sample/matrix/indices len = (107439386,) int32
indptr /sample/matrix/indptr (Dataset) /sample/matrix/indptr len = (15482,) int32
data /sample/matrix/data (Dataset) /sample/matrix/data len = (107439386,) float64
observation /observation (Group) /observation
metadata /observation/metadata (Group) /observation/metadata
taxonomy /observation/metadata/taxonomy (Dataset) /observation/metadata/taxonomy len = (5594412, 7) object
group-metadata /observation/group-metadata (Group) /observation/group-metadata
ids /observation/ids (Dataset) /observation/ids len = (5594412,) object
matrix /observation/matrix (Group) /observation/matrix
indices /observation/matrix/indices (Dataset) /observation/matrix/indices len = (107439386,) int32
indptr /observation/matrix/indptr (Dataset) /observation/matrix/indptr len = (5594413,) int32
data /observation/matrix/data (Dataset) /observation/matrix/data len = (107439386,) float64
additional information:
type(hdf5_file['observation']['ids'])
>>> <class 'h5py._hl.dataset.Dataset'>
dir(hdf5_file['observation']['ids'])
>>> ['__array__', '__class__', '__delattr__', '__dict__', '__doc__', '__eq__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_d', '_dcpl', '_e', '_filters', '_id', '_lapl', '_lcpl', '_local', 'astype', 'attrs', 'chunks', 'compression', 'compression_opts', 'dims', 'dtype', 'file', 'fillvalue', 'fletcher32', 'id', 'len', 'maxshape', 'name', 'parent', 'read_direct', 'ref', 'regionref', 'resize', 'scaleoffset', 'shape', 'shuffle', 'size', 'value', 'write_direct']
Upvotes: 3
Views: 9337
Reputation: 5048
This error occurred for me because I was trying to index into a h5py.Dataset
.
This usually happens for one of two reasons:
h5py.Group
, and really it was a h5py.Dataset
. You can check this by merely printing the type
of the thing you're trying to index before indexing it to make sure it's an h5py.Group
and not an h5py.Dataset
. (i.e. print(type(thing))
before doing thing["key"]
. You want it to say h5py.Group
, not h5py.Dataset
.)dataset[:]
on your h5py.Dataset
to get its numpy array or dataset[()]
to get its scalar value.Upvotes: 0
Reputation: 45
I did not find an easy answer using python, but if anyone else comes across a similar problem, use:
$biom subset-table -i input_file.biom -a observation -s ids_wanted.txt -o biom
you can then convert this:
$biom convert -i outputfilefromabove.biom -o outputfilefromabove.txt --to_tsv
This method is way easier and takes advantage of the great resources at biom!
Upvotes: 0
Reputation: 231738
I think this hdf5
file was produced by biom-format
software, http://biom-format.org
, From its pypi
package: https://pypi.python.org/packages/source/b/biom-format/biom-format-2.1.2.tar.gz
I can extract a sample hdf5
file: examples/min_sparse_otu_table_hdf5.biom
, and process it with:
import numpy as np
from scipy import sparse as sp
import h5py
f = h5py.File('min_sparse_otu_table_hdf5.biom','r')
f.visit(print) # to see its structure
"""
observation
observation/group-metadata
observation/ids
observation/matrix
observation/matrix/data
observation/matrix/indices
observation/matrix/indptr
observation/metadata
sample
sample/group-metadata
sample/ids
sample/matrix
sample/matrix/data
sample/matrix/indices
sample/matrix/indptr
sample/metadata
"""
Or in more detail:
f.visititems(lambda name,obj:print(name, obj))
"""
...
sample <HDF5 group "/sample" (4 members)>
sample/group-metadata <HDF5 group "/sample/group-metadata" (0 members)>
sample/ids <HDF5 dataset "ids": shape (6,), type "|O4">
sample/matrix <HDF5 group "/sample/matrix" (3 members)>
sample/matrix/data <HDF5 dataset "data": shape (15,), type "<f8">
sample/matrix/indices <HDF5 dataset "indices": shape (15,), type "<i4">
sample/matrix/indptr <HDF5 dataset "indptr": shape (7,), type "<i4">
sample/metadata <HDF5 group "/sample/metadata" (0 members)>
"""
This has a similar structure to your full_emp_table_hdf5.h5
Load arrays with .value
(or [:]
)
sample_ids = f['sample/ids'].value
"""
array([b'Sample1', b'Sample2', b'Sample3', b'Sample4', b'Sample5',
b'Sample6'], dtype=object)
"""
# create a sparse array from the sample/matrix group
data = f['sample/matrix/data'].value
indices = f['sample/matrix/indices'].value
indptr = f['sample/matrix/indptr'].value
sample = sp.csr_matrix((data,indices,indptr))
# display as a dense array
print(sample.A)
"""
array([[ 0., 5., 0., 2., 0.],
[ 0., 1., 0., 1., 1.],
[ 1., 0., 1., 1., 1.],
[ 0., 2., 4., 0., 0.],
[ 0., 3., 0., 0., 0.],
[ 0., 1., 2., 1., 0.]])
"""
# select a specific row
print(sample[2,:])
"""
(0, 0) 1.0
(0, 2) 1.0
(0, 3) 1.0
(0, 4) 1.0
"""
print(sample[2,:].A) # that row in dense format
# array([[ 1., 0., 1., 1., 1.]])
I can write sample
to a text file with:
np.savetxt('min_sample.txt',sample.A, '%.2f')
"""
0.00 5.00 0.00 2.00 0.00
0.00 1.00 0.00 1.00 1.00
1.00 0.00 1.00 1.00 1.00
0.00 2.00 4.00 0.00 0.00
0.00 3.00 0.00 0.00 0.00
0.00 1.00 2.00 1.00 0.00
"""
The biom
module probably can do all this and more, but the basics just need the numpy
and sparse
modules.
Upvotes: 1
Reputation: 1822
Change the line:
h.write(keys + "\n")
to this:
h.write(hdf5_file['observation']['ids'][keys] + "\n")
Upvotes: 0