Andy
Andy

Reputation: 185

Python: passing memmap array through function?

Suppose that I am working with very large array (e.g., ~45GB) and am trying to pass it through a function which open accepts numpy arrays. What is the best way to:

  1. Store this for limited memory?
  2. Pass this stored array into a function that takes only numpy arrays?

Upvotes: 3

Views: 797

Answers (1)

Aaron
Aaron

Reputation: 11075

TLDR; just try it...

I know nothing about hidden markov models, but as for numpy mmap's you may find it'll just work. I say this because np.memmap is a direct subclass of ndarray. That said, even in the documentation, it is stated that it "does not quite fit the ndarray subclass" and suggests it is possible to create the mmap object yourself with mmap.mmap(...). IMAO after looking at the numpy.memmap.__new__() function, there's not much more you could do to make it a drop in replacement, in which case you'll have to take a look at the functions you want to use, and why mmap arrays are not playing nice. If that happens, it may even be easier to alter those files than alter the way mmap is applied.

As a final note, when working directly from disk (even buffered) get ready for some slow computation times... I would suggest finding the appropriate source code and hacking in a progress indication to the computationally expensive partitions. Also incremental writeback can save you from re-computing large partitions of data if an error (or just a power outage) occurs.

Here's an example of how I might add progress reporting to GaussianHMM().fit():

additions are in bold

changes to hmmlearn\base.py:

class _BaseHMM(BaseEstimator):
    # ...
    def fit(self, X, lengths=None):
        # ...
        for iter in range(self.n_iter):
            stats = self._initialize_sufficient_statistics()
            curr_logprob = 0
            for i, j in iter_from_X_lengths(X, lengths, iter, self.n_iter): # tell our generator which iteration
                # ...
                pass

changes to hmmlearn\utils.py

def iter_from_X_lengths(X, lengths, iteration, stop):
    if lengths is None:
        yield 0, len(X)
        print("completion: 100%")
    else:
        length = len(lengths) #used every loop so I copied it to a local var
        n_samples = X.shape[0]
        end = np.cumsum(lengths).astype(np.int32)
        start = end - lengths
        if end[-1] > n_samples:
            raise ValueError("more than {0:d} samples in lengths array {1!s}"
                             .format(n_samples, lengths))

        for i in range(length):
            yield start[i], end[i] 
            # convert loop iterations to % completion
            print("completion: {}%".format(int((float(iteration)/stop)+(float(i)/length/stop))*100))

Upvotes: 1

Related Questions