user1578872
user1578872

Reputation: 9018

Kafka data compression technique

I loaded data(Selective data) from Oracle to Kafka with the replication factor of 1( So, only one copy ) and the data size in Kafka is 1TB. Kafka stores the data in a compressed format. But, I want to know the actual data size in Oracle. Since, we did selective tables and data load, I am not able to check the actual data size in Oracle. Is there any formula which I can apply to estimate the data size in Oracle for this 1TB data loaded in Kafka?

Kafka version - 2.1

Also, It took 4 hours to move data from oracle to kafka. The data size over the wire could be different. How to estimate the data over the wire and the bandwidth consumed?

Upvotes: 1

Views: 337

Answers (1)

LSerni
LSerni

Reputation: 57378

There is as yet insufficient data for a meaningful answer.

Kafka supports GZip, LZ4 and "Snappy" compressions, with different compression factors and different saturation thresholds. All three methods are "learning based", i.e. they consume bytes from a stream, build a dictionary and output bytes that are symbols from the dictionary. As a result, short data streams will not be compressed very much because the dictionary hasn't learned yet very much. And if the characteristics of the dictionary become unsuitable for the new incoming bytes, the compression ratio again goes down.

This means that the structure of the data can completely change the compression performances.

On the whole, in real world applications with reasonable data (i.e. not a DTM sparse matrix or a PDF or Office document storage system) you can expect on average between 1.2x and 2.0x. The larger the data chunks, the higher the compression. The actual content of the "message" also has great weight, as you can imagine.

Oracle then allocates data in data blocks, which means you get some slack space overhead, but then again it can compress those blocks. Oracle also performs deduplication in some instances.

Therefore, a meaningful and reasonably precise answer would have to depend on several factors that we don't know here.

As a ballpark figure, I'd say that the actual "logical" data from the 1TB Kafka ought to range between 0.7 and 2 TB, and I'd expect the Oracle occupation to be anywhere from 0.9 to 1.2 TB, if compression is available Oracle side, 1.2 TB to 2.4 TB if it is not.

But this is totally a shot in the dark. You could have compressed binary information (say, XLSX or JPEG-2000 files or MP3 songs) stored, and those would actually grow in size when compression was used. Or you might have swaths of sparse matricial data, that can easily compress 20:1 or more even with the most cursory gzipping. In the first case, the 1TB might remain more or less 1TB when compression was removed; in the second case, the same 1TB could just as easily grow to 20TB or more.

I am afraid the simplest way to know would be to instrument both storages and the network, and directly monitor traffic and data usage.

Once you knew the parameters of your DBs, you could extrapolate them to different storage amounts (so, say, if you know that 1TB Kafka requires 2.5TB network traffic to become 2.1 TB of Oracle tablespace, then it stands to reason that 2TB Kafka would require 5TB of traffic and occupy 4.2TB Oracle side)... but, even here, only provided the nature of the data did not change in the interim.

Upvotes: 1

Related Questions