Lino Costa
Lino Costa

Reputation: 57

Convert CSV to Parquet in S3 with Python

I need convert a CSV file to Parquet file in S3 path. I'm trying use the code below, but no error occurs, the code execute with success and dont convert the CSV file

import pandas as pd
import boto3
import pyarrow as pa
import pyarrow.parquet as pq

s3 = boto3.client("s3", region_name='us-east-2', aws_access_key_id='my key id',
                  aws_secret_access_key='my secret key')

obj = s3.get_object(Bucket='my bucket', Key='test.csv')
df = pd.read_csv(obj['Body'])
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table=table, root_path="test.parquet")

Upvotes: 1

Views: 9197

Answers (1)

Learn2Skills
Learn2Skills

Reputation: 29

AWS CSV to Parquet Converter in Python

This Script gets files from Amazon S3 and converts it to Parquet Version for later query jobs and uploads it back to the Amazon S3.

import numpy 
import pandas 
import fastparquet

def lambda_handler(event,context):

    #identifying resource
    s3_object = boto3.client('s3', region_name='us-east-2')

    #access file

    get_file = s3_object.get_object(Bucket='ENTER_BUCKET_NAME_HERE', Key='CSV_FILE_NAME.csv')
    
    get = get_file['Body']

    df = pandas.DataFrame(get)

    #convert csv to parquet function
    def conv_csv_parquet_file(df):
    
        converted_data_parquet = df.to_parquet('converted_data_parquet_version.parquet')
    
    conv_csv_parquet_file(df)

    print("File converted from CSV to parquet completed")

    #uploading the parquet version file

    s3_path = "/converted_to_parquet/" + converted_data_parquet

    put_response = s3_resource.Object('ENTER_BUCKET_NAME_HERE',converted_data_parquet).put(Body=converted_data_parquet)

Python Library Boto3 allows the lambda to get the CSV file from S3 and then Fast-Parquet (or Pyarrow) converts the CSV file into Parquet.

From- https://github.com/ayshaysha/aws-csv-to-parquet-converter.py

Upvotes: 1

Related Questions