Kaspar
Kaspar

Reputation: 33

Azure Data Explorer storage cost estimate calculation

I am new to ADX and I am currently investigating the possibility of ingesting parquet from blob storage to ADX for further processing.I am currently trying to come up with the cost estimate per month using this template.

https://azure.microsoft.com/en-us/pricing/calculator/?service=azure-data-explorer

I have difficulties to understand how to calculate the "Estimated data compression" using the ADX cost estimator or how ADX compress the data in general.

enter image description here

I have seen from different non-official sources that the default setting of the data compression is 7 (no official documentation is found regarding the estimation of data compression ratio in ADX).

Azure Data Explorer Cost Estimator giving implausible estimates

https://youtu.be/ndyPzbAi_kY?si=uSqB1M05_sOH7n6j&t=227

However, when I uploaded a parquet file from blob storage to ADX (both dev & prd yielded the same result), I realised that the size of the table, where the parquet file is ingested, has actually increased in comparsion to the original file instead of being compressed.

Blob Storage (27MB parquet file): enter image description here

Based on the default data compression ratio, my expectation of the table size after ingestion would be around 3-4 MB.

ADX (dev) (Engine: Dev(No SLA)_Standard_D11_v2, 1 instance) & ADX (prd) (Engine: Standard_L16as_v3, 2 instances): enter image description here

However,the size of the table is actually 74MB (TotalExtentSize:74MB, TotalOriginalSize: 874MB) , which is 3 times the size of the original parquet file.

Is the estimated data compression referring to TotalOriginalSize/ TotalExtentSize?

How is the TotalOriginalSize 874MB when the original parquet file is only 27MB?

I would like to know how to provide the correct estimated data compression for specifically parquet files and are there any ways to further compress the size of the table during the data ingestion?

Upvotes: 0

Views: 614

Answers (2)

SumanthMarigowda-MSFT
SumanthMarigowda-MSFT

Reputation: 2336

The estimated data compression would be 874MB/74MB=11.5. It’s certainly possible that Paquet compresses data to smaller sizes than Kusto – the latter includes text indexes which Parquet lacks, and emphasizes fast ingestion, both of which increase the resulting footprint.

Upvotes: 0

Avner Aharoni
Avner Aharoni

Reputation: 1

The original size is the normalized size of the data conceptually uncompressed CSV and it is calculated once the data is ingested. This is done in order to provide a number is comparable across all data formats and compression strategies. Thus if you convert the Parquet to uncompressed CSV, the size will be much closer to the OriginalSize number that Kusto provides.

You are correct that the compression ratio is usually calculated as the ratio between the original size to the extent size, however as long as the result of dividing the original size by the compression ratio in the cost calculator returns the extent size then the calculation is correct because the main aspects of the cost are calculated based on the extent size. That said, it is recommended to use the original size that Kusto provide so that comparing to other pipelines, data formats and compressions is apples to apples.

As for why parquet compression is better than Kusto in this case, it is likely that it because of the extensive string and dynamic indexing in Kusto that allows it to provide the much superior performance over querying Parquet directly.

As a side note, a more convenient cost estimator can be found here: http://aka.ms/adx.costestimator

Upvotes: 0

Related Questions