user4038636
user4038636

Reputation:

Read/write parquet files with AWS Lambda?

Hi I need a lambda function that will read and write parquet files and save them to S3. I tried to make a deployment package with libraries that I needed to use pyarrow but I am getting initialization error for cffi library:

module initialization error: [Errno 2] No such file or directory: '/var/task/__pycache__/_cffi__x762f05ffx6bf5342b.c'

Can I even make parquet files with AWS Lambda? Did anyone had similar problem?

I would like to do something like this:

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

df = pd.DataFrame([data]) #data is dictionary
table = pa.Table.from_pandas(df)
pq.write_table(table, 'tmp/test.parquet', compression='snappy')
table = pq.read_table('tmp/test.parquet')
table.to_pandas()
print(table)

Or by some other method, just need to be able to read and write parquet files compressed with snappy.

Upvotes: 4

Views: 14621

Answers (3)

leeprevost
leeprevost

Reputation: 444

I believe the modern version of this answer is to use an AWS Data Wrangler layer which has pandas and wr.s3.write_parquet natively in the layer. I use this and it works like a champ!!

Tutorial on Parquet Datasets. https://aws-data-wrangler.readthedocs.io/en/stable/tutorials/004%20-%20Parquet%20Datasets.html

Installing as a layer: https://aws-data-wrangler.readthedocs.io/en/stable/install.html

Upvotes: 0

bluu
bluu

Reputation: 552

For the inclusion of the dependencies needed for Snappy compression/decompression, please see Paul Zielinski's answer.

Regarding writing (and reading) to S3 itself you need to also use s3fs (and package it in the zip), adding the following to your code:

import s3fs
s3 = s3fs.S3FileSystem()

with s3.open('s3://your-bucket/path/to/test.parquet', 'wb') as f:
    pq.write_table(table, f)

with s3.open('s3://your-bucket/path/to/test.parquet', 'rb') as f:
    table = pq.read_table(f)

A note on your usage of table.to_pandas(): I don't believe this method works inplace on the table so if you don't assign it (df = table.to_pandas()) it's useless.

Finally, you could also use the following for reading a complete (partitioned) dataset from S3 directly:

dataset = pq.ParquetDataset(
    'your-bucket/path/to/your/dataset',
    filesystem=s3)
table = dataset.read()

with path/to/your/dataset being the path to the directory containing your dataset.

Thanks to Wes McKinney and DrChrisLevy(Github) for this last solution provided in ARROW-1213!

Upvotes: 0

Paul Zielinski
Paul Zielinski

Reputation: 111

I believe this is an issue with missing the snappy shared object file in the package deployed to lambda.

https://github.com/andrix/python-snappy/issues/52#issuecomment-342364113

I got the same error when trying to encode with snappy from a Lambda function (which is invoked from a directory to which it does not have write permissions), including libsnappy.so.1 in my zipfile resolved it.

Upvotes: 3

Related Questions