Loading numpy arrays stored in npz archive in PySpark

Question

I have a large number of numpy arrays in S3 stored in npz archive. What is the best way to load them into a PySpark RDD/Dataframe of NumPy arrays? I have tried to load the file using the sc.wholeTextFiles API.

rdd=sc.wholeTextFiles("s3://[bucket]/[folder_containing_npz_files]")

However numpy.load requires a file handle. And loading the file contents in memory as a string takes up a lot of memory.

zero323 · Accepted Answer

You cannot do much about memory requirements but otherwise BytesIO should work just fine:

from io import BytesIO

def extract(kv):
    k, v = kv
    with BytesIO(v) as r:
        for f, x in np.load(r).items():
            yield "{0}	{1}".format(k, f), x

sc.binaryFiles(inputPath).flatMap(extract)

Loading numpy arrays stored in npz archive in PySpark

Answers (1)

Related Questions