Reputation: 2085
I understand the concepts of partitioning and bucketing in Hive tables. But what I'd like to know is "when do we for partition and when do we go for bucketing ?" What are ideal scenarios that can be said as suitable for partitioning and bucketing ?
Upvotes: 4
Views: 8689
Reputation: 41
PARTITIONING
will be used when there are few unique values in the Column - which you want to load with your required WHERE clause
BUCKETING
will be used if there are multiple unique values in your Where clause
Upvotes: 0
Reputation: 409
The main reasons in which one uses partition and bucketing.
Partitioning of table data is done for distributing load horizontally .
Example: If we have a very large table names as "Parts" and often we run "where" queries that restricts the results to a particular Part Type.
For a faster query response the table can be partitioned by (PART_TYPE STRING).Once you partition the table it changes the way Hive structures the data storage and Hive now will create sub-directories which will reflect the structure of the partition like:
.../Parts/PART_TYPE = Engine-Part
.../Parts/Part_Type = Brakes
So,now if you run a query on table "Parts" with WHERE PART_TYPE = 'Engine-Part' , it will only scan the contents of one directory PART_TYPE = 'Engine-Part'
Partitioning feature is useful in Hive. but at the same time it may take long time to execute other queries.
Another drawback is if we create too many partitions which in turn creates large number of Hadoop files and directories that got created unnecessarily and it becomes an overhead to NameNode since NameNode must keep all metdatafiles for the file system in memory.
Bucketing is another technique which can be used to further divide the data into more manageable form.
Example: Suppose the table "part_sale" has a top level partition of "sale_date" and it is further partitioned into "part_type" as second level partition.
This will lead to too many small partitions .
.../part_sale/sale-date = 2017-04-18/part_type = engine_part1
.../part_sale/sale-date = 2017-04-18/part_type = engine_part2
.../part_sale/sale-date = 2017-04-18/part_type = engine_part3
.../part_sale/sale-date = 2017-04-18/part_type = engine_part4
If we bucket the "part_sale" table ,and use "part_type" as our bucketing column of the table.The value of this column will be hashed by a user-defined number into buckets.Records with the same "part_type" will always be stored in same bucket.You can specify the number of buckets at the time of table creation so that number of buckets are fixed and there is no fluctuation with data.
Upvotes: 4
Reputation: 1058
Partitioning helps in elimination of data, if used in WHERE clause, where as bucketing helps in organizing data in each partition into multiple files, so as same set of data is always written in same bucket. Helps a lot in joining of columns. Hive Buckets is nothing but another technique of decomposing data or decreasing the data into more manageable parts or equal parts.For example we have table with columns like date,employee_name,employee_id,salary,leaves etc . In this table just use date column as the top-level partition and the employee_id as the second-level partition leads to too many small partitions. We can use HASH value for bucketing or a range to bucket the data.
Hive partitioning and Bucketing is ,when we do partitioning, we create a partition for each unique value of the column. But there may be situation where we need to create lot of tiny partitions. But if you use bucketing, you can limit it to a number which you choose and decompose your data into those buckets. In hive a partition is a directory but a bucket is a file.
In hive, bucketing does not work by default. You will have to set following variable to enable bucketing. set hive.enforce.bucketing=true;
Upvotes: 0
Reputation: 850
Partitioning in Hive :- If we are dealing with a large table and often run queries with WHERE clauses that restrict the results to a particular partitioned column/columns, then we should leverage the partition concept of hive . For a faster query response Hive table can be PARTITIONED BY (partition_cols_name).Its helps to organize the data in logical fashion and when we query the partitioned table using partition column, it allows hive to skip all but relevant sub-directories and files, so scan becomes easy if partition is done properly. Should be done when the cardinality (number of possible values a field can have ) is not high. Else if there are too many partitions, then it is an overhead on the namenode.
Bucketing in Hive :- If you want to segregate the data on a field which has high cardinality (number of possible values a field can have ), then we should use bucketing. If we want only a sample of data according to some specific fields and not the entire data , bucketing can be a good option. If some map-side joins are involved, then bucketed tables are a good option.
Upvotes: 0