Rafaella Chiarella
Rafaella Chiarella

Reputation: 35

Allow_Pickle = True modified my dictionary to "unsized" when loaded

I am trying to save and load variables (dictionaries) to use in other notebooks. I save the variables with:

with open('opp2b.npy', 'wb') as f:
    np.save(f, mak)
    np.save(f, mp)
    
len(mak)
82

mak and mp are dictionaries with 82 entries of the same length. When loading if not using allow_pickle = True it will not load. So I use this

with open('opp2b.npy', 'rb') as f:
    mak = np.load(f, allow_pickle=True)
    mp = np.load(f, allow_pickle=True)

and when I check the length I get

len(mak)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-bb967ce1f5ef> in <module>
----> 1 len(mak)

TypeError: len() of unsized object

1

I am not sure why the array is modified, but it is now unusable for what I need it.

Upvotes: 0

Views: 238

Answers (1)

ShadowRanger
ShadowRanger

Reputation: 155506

Per your comments, mak is not a numpy array at all. numpy.save is specifically documented to:

Save an array to a binary file in NumPy .npy format.

allow_pickle is for numpy arrays containing Python objects, but the .npy format is not intended to store things that aren't numpy arrays at all. To successfully store the dict, it's wrapping it in a 0D numpy "array", and that's what np.load is giving you. You could extract the original dict by doing:

mak = mak.item(0)  # mak = mak[0] doesn't work, and I'm unclear on why .item(0) works,
                   # as the docs claim the only difference is that .item(0) returns
                   # a Python scalar, rather than a numpy scalar, and that's not an
                   # issue here, but I assume something about 0D arrays requires this

But really, that's trying to put a square peg in a round hole. If you're not storing numpy arrays, there's little benefit to the .npy format, if any. The main advantages it provides are:

  1. Avoiding arbitrary code execution for untrusted inputs (since you need to allow_pickle, that advantage goes away)
  2. Allowing you to memory map on load (irrelevant when the entire data structure must be pickled anyway; memory mapping helps only for C level data where you might benefit from lazy loading and better performance if RAM grows short, as the data need not be written to swap before the pages are reclaimed)
  3. (No longer relevant on modern Python) Stores array data more efficiently than the old pickle protocol 0 (that produced legal ASCII output, meaning only bytes of 127 or below, which made pickling raw binary data inefficient). As long as you're using protocol 2 or higher (which is binary, handles new-style classes efficiently, and is supported back to Python 2.3), it should store your data efficiently. As of Python 3.0, the default protocol is protocol 3 (rising to protocol 4 in 3.8), so if you're using a supported version of Python, and don't specify the protocol, it will use 3 or 4 (both of which work fine; protocol 4 being better if you're pickling huge objects).

Since you aren't storing numpy arrays, just rely on the pickle module directly to store arbitrary data (for modern pickle protocols, which allow efficient binary storage, numpy stores efficiently enough anyway, so the .npy format isn't helping much, if at all; for some trivial test cases I tried, saving {'a': numpy.array([0,1,2])}, the .npy dump was over twice the size).

import pickle  # At top of file

with open('opp2b.pkl', 'wb') as f:  # Name with common pickle extension instead of .npy
    pickle.dump(mak, f)  # Argument order reversed from np.save
    pickle.dump(mp, f)

and then to load:

with open('opp2b.pkl', 'rb') as f:  # Matching change in name
    mak = pickle.load(f)
    mp = pickle.load(f)

This assumes you might in fact want to load only one data set or the other at a time; if you plan to store and load both all the time, you may as well condense it to a single write of a tuple of the relevant values (increasing the chance that duplicated objects across the two objects can use back-references to avoid reserializing the same data multiple times), e.g.:

with open('opp2b.pkl', 'wb') as f:
    pickle.dump((mak, mp), f)

and:

with open('opp2b.pkl', 'rb') as f:
    mak, mp = pickle.load(f)

Upvotes: 3

Related Questions