4a616e
4a616e

Reputation: 33

Loading numpy arrays stored in npz archive in PySpark

I have a large number of numpy arrays in S3 stored in npz archive. What is the best way to load them into a PySpark RDD/Dataframe of NumPy arrays? I have tried to load the file using the sc.wholeTextFiles API.

rdd=sc.wholeTextFiles("s3://[bucket]/[folder_containing_npz_files]") 

However numpy.load requires a file handle. And loading the file contents in memory as a string takes up a lot of memory.

Upvotes: 2

Views: 2393

Answers (1)

zero323
zero323

Reputation: 330373

You cannot do much about memory requirements but otherwise BytesIO should work just fine:

from io import BytesIO

def extract(kv):
    k, v = kv
    with BytesIO(v) as r:
        for f, x in np.load(r).items():
            yield "{0}\t{1}".format(k, f), x

sc.binaryFiles(inputPath).flatMap(extract)

Upvotes: 2

Related Questions