binu456m
binu456m

Reputation: 71

how to use GZIP to compress JSON data in python program?

I have an AWS Kinesis python program - Producer to send data to my stream. But my JSON file is 5MB. I would like to compress the data using GZIP or any other best methods. My producer code is like this :

import boto3
import json
import csv
from datetime import datetime
import calendar
import time
import random



# putting data to Kinesis

my_stream_name='ApacItTeamTstOrderStream'

kinesis_client=boto3.client('kinesis',region_name='us-east-1')


with open('output.json', 'r') as file:
    for line in file:
        put_response=kinesis_client.put_record(
            StreamName=my_stream_name,
            Data=line,
            PartitionKey=str(random.randrange(3000)))
    
        print(put_response)

my requirement is :

I need to compress this data and then pushed the compressed data to Kinesis after pushing this data, when we consume this, we need to decompress it...

Since I am very new to this, can someone guide me or suggest to me what kind of programs I should add in the existing code?

Upvotes: 0

Views: 1927

Answers (1)

Nishit
Nishit

Reputation: 1354

There are 2 ways in which you can compress the data :

1. Enable GZIP/Snappy compression on Firehose Stream - This can be done via Console itself

Firehose buffers the data and after the treshold is reached, it takes all the data and compresses it together to create the gz object.

Pros :

  • Minimal Effort Required on Producer side - Just change the setting in console.
  • Minimal Effort Required on Consumer Side - Firehose creates .gz objects in S3 and sets the metadata on the objects to reflect the compression type. Hence, if you read the data via AWS SDK itself, the SDK will do the decompression for you.

Cons :

  • Since firehose charges on size of data ingested, you will not be saving on Firehose cost. You will save on S3 cost (due to smaller size of objects).

2. Compression by Producer code - Need to write the code

I implemented this in Java a few days back. We were ingesting over 100 Petabytes of data into Firehose (from where it gets written to S3). This was a massive cost for us.

So, we decided to do the compression on Producer side. This results in compressed data flowing to KF which is as is written to S3. Please note that since KF is not compressing it, it has no idea what data it is. As a result, the objects created in s3 don't have ".gz" compression. Hence, the consumers are none the wiser as to what data is in the objects. We then wrote a wrapper on top of AWS Java SDK for S3 which reads the object and decompresses it.

Pros :

  • Our compression factor was close to 90%. That directly resulted in a 90% savings on Firehose cost. Plus the additional savings of S3 as in approach 1.

Cons :

  • Not exactly a con, but more developmental effort would be required. To create a wrapper on top of the AWS SDK, testing effort etc.
  • Compression & Decompression are CPU intensive. On an average, the 2 together increased our CPU utilization by 22%.

Upvotes: 3

Related Questions