JaviOverflow
JaviOverflow

Reputation: 1480

Pandas dataframe to parquet buffer in memory

The use case is the following:

  1. Read data from external database and load it into pandas dataframe
  2. Transform that dataframe into parquet format buffer
  3. Upload that buffer to s3

I've been trying to do step two in-memory (without having to store the file to disk in order to get the parquet format), but all the libraries I've seen so far, they always write to disk.

So I have the following questions:

Upvotes: 13

Views: 12983

Answers (2)

JD D
JD D

Reputation: 8137

Apache Arrow and the pyarrow library should solve this and does much of the processing in memory. In pandas you can read/write parquet files via pyarrow.

Some example code that also leverages smart_open as well.

import pandas as pd
import boto3
from smart_open import open
from io import BytesIO

s3 = boto3.client('s3')

# read parquet file into memory
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_parquet(BytesIO(obj['Body'].read()), engine='pyarrow')

# do stuff with dataframe

# write parquet file to s3 out of memory
with open(f's3://{outputBucket}/{outputPrefix}{additionalSuffix}', 'wb') as out_file:
    df.to_parquet(out_file, engine='pyarrow', index=False)

Upvotes: 12

aneroid
aneroid

Reputation: 16007

Wouldn't it be more performant if the conversion was done in-memory since you don't have to deal with I/O disk overhead?

Yes, it would. And for that you could use a BytesIO object (or StringIO), which can be used in place of file descriptors. If you're using pyarrow, you have NativeFile.

As you increase the concurrent processes converting files and storing them into disk, couldn't we have issues regarding disk such as running out of space at some points or reaching throughput limit of the disk ?

Also true but that's a limitation on any read/write from/to a filesystem, including databases. Disk space can be saved by ensuring that files are deleted once you're done with them. Also, you're more likely to reach your bandwidth limit before you reach your disk throughput limit, unless you're processing a lot of on-disk data or SQL statements.

... but all the libraries I've seen so far, they always write to disk.

Unless the functions explicitly need a "filename", you can replace the file-pointers (fp's) with a buffer object as mentioned above.

Upvotes: 0

Related Questions