Pandas dataframe to parquet buffer in memory

The use case is the following:

Read data from external database and load it into pandas dataframe
Transform that dataframe into parquet format buffer
Upload that buffer to s3

I've been trying to do step two in-memory (without having to store the file to disk in order to get the parquet format), but all the libraries I've seen so far, they always write to disk.

So I have the following questions:

Wouldn't it be more performant if the conversion was done in-memory since you don't have to deal with I/O disk overhead?
As you increase the concurrent processes converting files and storing them into disk, couldn't we have issues regarding disk such as running out of space at some points or reaching throughput limit of the disk ?

Upvotes: 13

Answers (2)

JD D

Reputation: 8137

Apache Arrow and the pyarrow library should solve this and does much of the processing in memory. In pandas you can read/write parquet files via pyarrow.

Some example code that also leverages smart_open as well.

import pandas as pd
import boto3
from smart_open import open
from io import BytesIO

s3 = boto3.client('s3')

# read parquet file into memory
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_parquet(BytesIO(obj['Body'].read()), engine='pyarrow')

# do stuff with dataframe

# write parquet file to s3 out of memory
with open(f's3://{outputBucket}/{outputPrefix}{additionalSuffix}', 'wb') as out_file:
    df.to_parquet(out_file, engine='pyarrow', index=False)

Upvotes: 12

aneroid

Reputation: 16007

Wouldn't it be more performant if the conversion was done in-memory since you don't have to deal with I/O disk overhead?

Yes, it would. And for that you could use a BytesIO object (or StringIO), which can be used in place of file descriptors. If you're using pyarrow, you have NativeFile.

As you increase the concurrent processes converting files and storing them into disk, couldn't we have issues regarding disk such as running out of space at some points or reaching throughput limit of the disk ?

Also true but that's a limitation on any read/write from/to a filesystem, including databases. Disk space can be saved by ensuring that files are deleted once you're done with them. Also, you're more likely to reach your bandwidth limit before you reach your disk throughput limit, unless you're processing a lot of on-disk data or SQL statements.

... but all the libraries I've seen so far, they always write to disk.

Unless the functions explicitly need a "filename", you can replace the file-pointers (fp's) with a buffer object as mentioned above.

Upvotes: 0

Pandas dataframe to parquet buffer in memory

Answers (2)

Related Questions