Convert CSV to Parquet in S3 with Python

I need convert a CSV file to Parquet file in S3 path. I'm trying use the code below, but no error occurs, the code execute with success and dont convert the CSV file

import pandas as pd
import boto3
import pyarrow as pa
import pyarrow.parquet as pq

s3 = boto3.client("s3", region_name='us-east-2', aws_access_key_id='my key id',
                  aws_secret_access_key='my secret key')

obj = s3.get_object(Bucket='my bucket', Key='test.csv')
df = pd.read_csv(obj['Body'])
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table=table, root_path="test.parquet")

Upvotes: 1

Answers (1)

Learn2Skills

Reputation: 27

AWS CSV to Parquet Converter in Python

This Script gets files from Amazon S3 and converts it to Parquet Version for later query jobs and uploads it back to the Amazon S3.

import numpy 
import pandas 
import fastparquet

def lambda_handler(event,context):

    #identifying resource
    s3_object = boto3.client('s3', region_name='us-east-2')

    #access file

    get_file = s3_object.get_object(Bucket='ENTER_BUCKET_NAME_HERE', Key='CSV_FILE_NAME.csv')
    
    get = get_file['Body']

    df = pandas.DataFrame(get)

    #convert csv to parquet function
    def conv_csv_parquet_file(df):
    
        converted_data_parquet = df.to_parquet('converted_data_parquet_version.parquet')
    
    conv_csv_parquet_file(df)

    print("File converted from CSV to parquet completed")

    #uploading the parquet version file

    s3_path = "/converted_to_parquet/" + converted_data_parquet

    put_response = s3_resource.Object('ENTER_BUCKET_NAME_HERE',converted_data_parquet).put(Body=converted_data_parquet)

Python Library Boto3 allows the lambda to get the CSV file from S3 and then Fast-Parquet (or Pyarrow) converts the CSV file into Parquet.

From- https://github.com/ayshaysha/aws-csv-to-parquet-converter.py

Upvotes: 1

Convert CSV to Parquet in S3 with Python

Answers (1)

Related Questions