Reputation: 97
I've been working on a project where I've been storing the iot data in s3 bucket and batching them using aws kinesis firehose, i have a lambda function running on the delivery stream where i convert the epoch milliseconds time to proper timestamp having date and time. here is my sample JSON payload
{
"device_name":"inHand-RTU",
"Temperature":22.3,
"Pyranometer":6,
"Active-Power":0,
"Voltage-1":233.93,
"Active-Import":2.57,
"time":"17-01-2023 10:49:09"
}
I now want to convert these files in s3 to parquet files and then do processing on them using apache pyspark. What is the best way to do so? Should I use kinesis firehose itself where it provides the functionality to convert the data into parquet format, or should i go with aws glue jobs. Both the services does the same thing. what is the difference between both? Which approach should I follow?
Any help will be greatly appreciated.
Upvotes: 1
Views: 2302
Reputation: 6333
Best way is to use native parquet conversion as part of firehose.
Firehose has an option (Convert record format - Enable it) to convert to parquet or Orc format before delivering them to S3
https://docs.aws.amazon.com/firehose/latest/dev/create-transform.html
Upvotes: 4