Reputation: 57
I need convert a CSV file to Parquet file in S3 path. I'm trying use the code below, but no error occurs, the code execute with success and dont convert the CSV file
import pandas as pd
import boto3
import pyarrow as pa
import pyarrow.parquet as pq
s3 = boto3.client("s3", region_name='us-east-2', aws_access_key_id='my key id',
aws_secret_access_key='my secret key')
obj = s3.get_object(Bucket='my bucket', Key='test.csv')
df = pd.read_csv(obj['Body'])
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table=table, root_path="test.parquet")
Upvotes: 1
Views: 9197
Reputation: 29
AWS CSV to Parquet Converter in Python
This Script gets files from Amazon S3 and converts it to Parquet Version for later query jobs and uploads it back to the Amazon S3.
import numpy
import pandas
import fastparquet
def lambda_handler(event,context):
#identifying resource
s3_object = boto3.client('s3', region_name='us-east-2')
#access file
get_file = s3_object.get_object(Bucket='ENTER_BUCKET_NAME_HERE', Key='CSV_FILE_NAME.csv')
get = get_file['Body']
df = pandas.DataFrame(get)
#convert csv to parquet function
def conv_csv_parquet_file(df):
converted_data_parquet = df.to_parquet('converted_data_parquet_version.parquet')
conv_csv_parquet_file(df)
print("File converted from CSV to parquet completed")
#uploading the parquet version file
s3_path = "/converted_to_parquet/" + converted_data_parquet
put_response = s3_resource.Object('ENTER_BUCKET_NAME_HERE',converted_data_parquet).put(Body=converted_data_parquet)
Python Library Boto3 allows the lambda to get the CSV file from S3 and then Fast-Parquet (or Pyarrow) converts the CSV file into Parquet.
From- https://github.com/ayshaysha/aws-csv-to-parquet-converter.py
Upvotes: 1