Reputation: 195
I'm trying to create a parquet table in hive. I can create it but when i run analyze table mytable compute statistics; i get this result:
numfiles=800, numrows=10000000, totalSize=18909876 rawDataSize=40000000
Why the table is made-up of 800 file for only 180 Mb? There is a why to set the number of file? I try with SET parquet.block.size=134217728 but the result is the same
Upvotes: 3
Views: 2523
Reputation: 14871
Number of reducers determines number of parquet files.
Check mapred.reduce.tasks
parameter.
E.g. you may have a map-reduce job that produces just 100 rows, but if mapred.reduce.tasks
is set to 800 (explicitly or implicitly), you'll have 800 parquet files as output (most of parquet files will have only headers and no actual data).
Upvotes: 2
Reputation: 23
You also need to set set dfs.blocksize=134217728 along with SET parquet.block.size=134217728 Both the block size should be set while doing hive insert.
Upvotes: 0