Reputation: 4377
I have two identical tables; one created as the result of using a crawler on a .csv and the other an Iceberg table created with the following command:
CREATE TABLE dan_grafana.iced (
meter string,
readtime timestamp,
kwh_total double)
PARTITIONED BY (`meter`, year(`readtime`))
LOCATION 's3://dev-aws/iceberg/iced'
TBLPROPERTIES (
'table_type'='iceberg',
'format'='parquet',
'optimize_rewrite_delete_file_threshold'='10',
'write_target_data_file_size_bytes'='134217728'
);
After creating the Iceberg table I copied the data from the .csv file into it; this was the only operation I performed on the Iceberg table. Reading the data from the Iceberg tables takes twice as long as reading the normal .csv file, even though the cost is the same. The number of bytes scanned is 5x more when reading the Iceberg table.
How can I improve the performance of the Iceberg table?
Upvotes: 0
Views: 1001
Reputation: 2210
As rightly suggested, this is not the best way to compare the performance of Iceberg formatted table with csv based table. Yes, you need to increase data volume. Also try creating partitioned table because that where you will see the real difference i.e. filtering the data based on partitioned columns. You can also add sorting to iceberg table and use those columns in your query which will further enhance the performance due to predicate pushdown.
Upvotes: 0