Reputation: 1480
The use case is the following:
I've been trying to do step two in-memory (without having to store the file to disk in order to get the parquet format), but all the libraries I've seen so far, they always write to disk.
So I have the following questions:
Upvotes: 13
Views: 12983
Reputation: 8137
Apache Arrow and the pyarrow library should solve this and does much of the processing in memory. In pandas
you can read/write parquet files via pyarrow
.
Some example code that also leverages smart_open as well.
import pandas as pd
import boto3
from smart_open import open
from io import BytesIO
s3 = boto3.client('s3')
# read parquet file into memory
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_parquet(BytesIO(obj['Body'].read()), engine='pyarrow')
# do stuff with dataframe
# write parquet file to s3 out of memory
with open(f's3://{outputBucket}/{outputPrefix}{additionalSuffix}', 'wb') as out_file:
df.to_parquet(out_file, engine='pyarrow', index=False)
Upvotes: 12
Reputation: 16007
Wouldn't it be more performant if the conversion was done in-memory since you don't have to deal with I/O disk overhead?
Yes, it would. And for that you could use a BytesIO
object (or StringIO
), which can be used in place of file descriptors. If you're using pyarrow, you have NativeFile
.
As you increase the concurrent processes converting files and storing them into disk, couldn't we have issues regarding disk such as running out of space at some points or reaching throughput limit of the disk ?
Also true but that's a limitation on any read/write from/to a filesystem, including databases. Disk space can be saved by ensuring that files are deleted once you're done with them. Also, you're more likely to reach your bandwidth limit before you reach your disk throughput limit, unless you're processing a lot of on-disk data or SQL statements.
... but all the libraries I've seen so far, they always write to disk.
Unless the functions explicitly need a "filename", you can replace the file-pointers (fp
's) with a buffer object as mentioned above.
Upvotes: 0