Reputation: 32081
I am deserializing large numpy arrays (500MB in this example) and I find the results vary by orders of magnitude between approaches. Below are the 3 approaches I've timed.
I'm receiving the data from the multiprocessing.shared_memory
package, so the data comes to me as a memoryview
object. But in these simple examples, I just pre-create a byte array to run the test.
I wonder if there are any mistakes in these approaches, or if there are other techniques I didn't try. Deserialization in Python is a real pickle of a problem if you want to move data fast and not lock the GIL just for the IO. A good explanation as to why these approaches vary so much would also be a good answer.
""" Deserialization speed test """
import numpy as np
import pickle
import time
import io
sz = 524288000
sample = np.random.randint(0, 255, size=sz, dtype=np.uint8) # 500 MB data
serialized_sample = pickle.dumps(sample)
serialized_bytes = sample.tobytes()
serialized_bytesio = io.BytesIO()
np.save(serialized_bytesio, sample, allow_pickle=False)
serialized_bytesio.seek(0)
result = None
print('Deserialize using pickle...')
t0 = time.time()
result = pickle.loads(serialized_sample)
print('Time: {:.10f} sec'.format(time.time() - t0))
print('Deserialize from bytes...')
t0 = time.time()
result = np.ndarray(shape=sz, dtype=np.uint8, buffer=serialized_bytes)
print('Time: {:.10f} sec'.format(time.time() - t0))
print('Deserialize using numpy load from BytesIO...')
t0 = time.time()
result = np.load(serialized_bytesio, allow_pickle=False)
print('Time: {:.10f} sec'.format(time.time() - t0))
Results:
Deserialize using pickle...
Time: 0.2509949207 sec
Deserialize from bytes...
Time: 0.0204288960 sec
Deserialize using numpy load from BytesIO...
Time: 28.9850852489 sec
The second option is the fastest, but notably less elegant because I need to explicitly serialize the shape and dtype information.
Upvotes: 11
Views: 3504
Reputation: 1126
I found your question useful, I'm looking for best numpy serialization and confirmed that np.load() was best except it was beaten by pyarrow
in my add on test below. Arrow is now a super popular data serialization framework for distributed compute (E.g. Spark, ...)
""" Deserialization speed test """
import numpy as np
import pickle
import time
import io
import pyarrow as pa
sz = 524288000
sample = np.random.randint(0, 255, size=sz, dtype=np.uint8) # 500 MB data
pa_buf = pa.serialize(sample).to_buffer()
serialized_sample = pickle.dumps(sample)
serialized_bytes = sample.tobytes()
serialized_bytesio = io.BytesIO()
np.save(serialized_bytesio, sample, allow_pickle=False)
serialized_bytesio.seek(0)
result = None
print('Deserialize using pickle...')
t0 = time.time()
result = pickle.loads(serialized_sample)
print('Time: {:.10f} sec'.format(time.time() - t0))
print('Deserialize from bytes...')
t0 = time.time()
result = np.ndarray(shape=sz, dtype=np.uint8, buffer=serialized_bytes)
print('Time: {:.10f} sec'.format(time.time() - t0))
print('Deserialize using numpy load from BytesIO...')
t0 = time.time()
result = np.load(serialized_bytesio, allow_pickle=False)
print('Time: {:.10f} sec'.format(time.time() - t0))
print('Deserialize pyarrow')
t0 = time.time()
restored_data = pa.deserialize(pa_buf)
print('Time: {:.10f} sec'.format(time.time() - t0))
Results from i3.2xlarge on Databricks Runtime 8.3ML Python 3.8, Numpy 1.19.2, Pyarrow 1.0.1
Deserialize using pickle...
Time: 0.4069395065 sec
Deserialize from bytes...
Time: 0.0281322002 sec
Deserialize using numpy load from BytesIO...
Time: 0.3059172630 sec
Deserialize pyarrow
Time: 0.0031735897 sec
Your BytesIO results were about 100x more than mine, which I don't know why.
Upvotes: 3