j.arthur5
j.arthur5

Reputation: 13

Reading hdf5 file quickly with cython and h5py

I'm trying to speed up a python3 function that takes some data, which is an array of indexes and saves them if they meet a certain criterion. I have tried to speed it up by using "cython -a script.py", but the bottle neck seems to be the h5py I/O slicing datasets.

I'm relatively new to cython, so I was wondering whether there is anyway to speed this up or am I just limited by the h5py I/O here?

Here is the function I'm trying to improve:

import numpy as np
import h5py

cimport numpy as np
cimport cython
from libc.math cimport sqrt

DTYPE64 = np.int64
ctypedef np.int64_t DTYPE64_t
DTYPE32 = np.int32
ctypedef np.int32_t DTYPE32_t

@cython.boundscheck(False)
@cython.wraparound(False)
def tag_subhalo_branch(np.ndarray[DTYPE64_t] halos_z0_treeindxs,
                       np.ndarray[DTYPE64_t] tree_pindx,
                       np.ndarray[DTYPE32_t] tree_psnapnum,
                       np.ndarray[DTYPE64_t] tree_psnapid,
                       np.ndarray[DTYPE64_t] tree_hsnapid, hf,
                       int size):

    cdef int i
    cdef double radial, progen_x, progen_y, progen_z
    cdef double host_x, host_y, host_z, host_rvir
    cdef DTYPE64_t progen_indx, progen_haloid, host_id
    cdef DTYPE32_t progen_snap
    cdef int j = 0
    cdef int size_array = size
    cdef np.ndarray[DTYPE64_t] backsplash_ids = np.zeros(size_array,
                                                         dtype=DTYPE64)


    for i in range(0, size_array):
        progen_indx = tree_pindx[halos_z0_treeindxs[i]]
        if progen_indx != -1:
            progen_snap = tree_psnapnum[progen_indx]
            progen_haloid = tree_psnapid[progen_indx]

            while progen_indx != -1 and progen_snap != -1:
                # ** This is slow **
                grp = hf['Snapshots/snap_' + str('%03d' % progen_snap) + '/']
                host_id = grp['HaloCatalog'][(progen_haloid - 1), 2]
                # **

                if host_id != -1:
                    # ** This is slow **
                    progen_x = grp['HaloCatalog'][(progen_haloid - 1), 6]
                    host_x = grp['HaloCatalog'][(host_id - 1), 6]
                    progen_y = grp['HaloCatalog'][(progen_haloid - 1), 7]
                    host_y = grp['HaloCatalog'][(host_id - 1), 7]
                    progen_z = grp['HaloCatalog'][(progen_haloid - 1), 8]
                    host_z = grp['HaloCatalog'][(host_id - 1), 8]
                    # **
                    radial = 0
                    radial += (progen_x - host_x)**2
                    radial += (progen_y - host_y)**2
                    radial += (progen_z - host_z)**2
                    radial = sqrt(radial)

                    host_rvir = grp['HaloCatalog'][(host_id - 1), 24]
                    if radial <= host_rvir:
                        backsplash_ids[j] = tree_hsnapid[
                            halos_z0_treeindxs[i]]
                        j += 1
                        break

                # Find next progenitor information
                progen_indx = tree_pindx[progen_indx]
                progen_snap = tree_psnapnum[progen_indx]
                progen_haloid = tree_psnapid[progen_indx]
    return backsplash_ids

Upvotes: 0

Views: 1368

Answers (1)

hpaulj
hpaulj

Reputation: 231385

As described here: http://api.h5py.org/, h5py uses cython code to interface with the HDF5 c code. So your own cython code might be able to access that directly. But I suspect that will require a lot more study.

Your code is using the Python interface to h5py, and cythonizing isn't going to touch that.

cython code is best used for low level actions, especially iterative things that can't be expressed as array operations. Study and experiment with the numpy examples first. You are diving into cython at the deep end of the pool.

Have you tried to improve that code just with Python and numpy? Just at glance I'm seeing a lot of redundant h5py calls.

====================

Your radial calculation accesses the h5py indexing 6 times when it could get by with 2. Maybe you wrote it that way in hopes that cython would preform the following calculation faster than numpy?

data = grp['HaloCatalog']
progen = data[progen_haloid-1, 6:9]
host = data[host_id-1, 6:9]
radial = np.sqrt((progren-host)**2).sum(axis=1))

Why not load all data[progen_haloid-1,:] and data[host_id-1,:]? Even all of data? I'd have to review when h5py switches from working directly with the arrays on the file and when they become numpy arrays. In any case, math on arrays in memory will be a lot faster than file reads.

Upvotes: 2

Related Questions