Reputation: 185
Suppose that I am working with very large array (e.g., ~45GB) and am trying to pass it through a function which open accepts numpy arrays. What is the best way to:
Upvotes: 3
Views: 797
Reputation: 11075
I know nothing about hidden markov models, but as for numpy mmap's you may find it'll just work. I say this because np.memmap
is a direct subclass of ndarray
. That said, even in the documentation, it is stated that it "does not quite fit the ndarray subclass" and suggests it is possible to create the mmap object yourself with mmap.mmap(...)
. IMAO after looking at the numpy.memmap.__new__()
function, there's not much more you could do to make it a drop in replacement, in which case you'll have to take a look at the functions you want to use, and why mmap arrays are not playing nice. If that happens, it may even be easier to alter those files than alter the way mmap is applied.
As a final note, when working directly from disk (even buffered) get ready for some slow computation times... I would suggest finding the appropriate source code and hacking in a progress indication to the computationally expensive partitions. Also incremental writeback can save you from re-computing large partitions of data if an error (or just a power outage) occurs.
Here's an example of how I might add progress reporting to GaussianHMM().fit()
:
additions are in bold
changes to hmmlearn\base.py
:
class _BaseHMM(BaseEstimator):
# ...
def fit(self, X, lengths=None):
# ...
for iter in range(self.n_iter):
stats = self._initialize_sufficient_statistics()
curr_logprob = 0
for i, j in iter_from_X_lengths(X, lengths, iter, self.n_iter): # tell our generator which iteration
# ...
pass
changes to hmmlearn\utils.py
def iter_from_X_lengths(X, lengths, iteration, stop):
if lengths is None:
yield 0, len(X)
print("completion: 100%")
else:
length = len(lengths) #used every loop so I copied it to a local var
n_samples = X.shape[0]
end = np.cumsum(lengths).astype(np.int32)
start = end - lengths
if end[-1] > n_samples:
raise ValueError("more than {0:d} samples in lengths array {1!s}"
.format(n_samples, lengths))
for i in range(length):
yield start[i], end[i]
# convert loop iterations to % completion
print("completion: {}%".format(int((float(iteration)/stop)+(float(i)/length/stop))*100))
Upvotes: 1