Reputation: 71
I have an AWS Kinesis python program - Producer to send data to my stream. But my JSON file is 5MB. I would like to compress the data using GZIP or any other best methods. My producer code is like this :
import boto3
import json
import csv
from datetime import datetime
import calendar
import time
import random
# putting data to Kinesis
my_stream_name='ApacItTeamTstOrderStream'
kinesis_client=boto3.client('kinesis',region_name='us-east-1')
with open('output.json', 'r') as file:
for line in file:
put_response=kinesis_client.put_record(
StreamName=my_stream_name,
Data=line,
PartitionKey=str(random.randrange(3000)))
print(put_response)
my requirement is :
I need to compress this data and then pushed the compressed data to Kinesis after pushing this data, when we consume this, we need to decompress it...
Since I am very new to this, can someone guide me or suggest to me what kind of programs I should add in the existing code?
Upvotes: 0
Views: 1927
Reputation: 1354
There are 2 ways in which you can compress the data :
1. Enable GZIP/Snappy compression on Firehose Stream - This can be done via Console itself
Firehose buffers the data and after the treshold is reached, it takes all the data and compresses it together to create the gz object.
Pros :
Cons :
2. Compression by Producer code - Need to write the code
I implemented this in Java a few days back. We were ingesting over 100 Petabytes of data into Firehose (from where it gets written to S3). This was a massive cost for us.
So, we decided to do the compression on Producer side. This results in compressed data flowing to KF which is as is written to S3. Please note that since KF is not compressing it, it has no idea what data it is. As a result, the objects created in s3 don't have ".gz" compression. Hence, the consumers are none the wiser as to what data is in the objects. We then wrote a wrapper on top of AWS Java SDK for S3 which reads the object and decompresses it.
Pros :
Cons :
Upvotes: 3