Reputation: 33
I have a large number of numpy arrays in S3 stored in npz archive. What is the best way to load them into a PySpark RDD/Dataframe of NumPy arrays? I have tried to load the file using the sc.wholeTextFiles API.
rdd=sc.wholeTextFiles("s3://[bucket]/[folder_containing_npz_files]")
However numpy.load requires a file handle. And loading the file contents in memory as a string takes up a lot of memory.
Upvotes: 2
Views: 2393
Reputation: 330373
You cannot do much about memory requirements but otherwise BytesIO
should work just fine:
from io import BytesIO
def extract(kv):
k, v = kv
with BytesIO(v) as r:
for f, x in np.load(r).items():
yield "{0}\t{1}".format(k, f), x
sc.binaryFiles(inputPath).flatMap(extract)
Upvotes: 2