Reputation: 123
I have created a hive table with gender as bucket column.
create table userinfoBucketed(userid INT,age INT,gender STRING,occupation STRING,zipcode STRING) CLUSTERED BY(gender) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;
loading following data from text file into table (user id | age | gender | occupation | zip code) :
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
8|36|M|administrator|05201
9|29|M|student|01002
10|53|M|lawyer|90703
I have set the hive.enforce.bucketing property to true; set hive.enforce.bucketing=true;
1, when inserted data into table using load command, buckets are not created. all the data stored in one bucket
load data local inpath '/home/mainnode/u.user' into table userinfobucketed;
Question1, Why the data is not split into 2 buckets?
2, when inserted data into table from other table, data stored in 2 buckets. here is the command I have executed:
insert into table userinfobucketed select * from userinfo where gender='M';
Now bucket1(000000_0) has below data: 1|24|M|technician|85711 4|24|M|technician|43537 6|42|M|executive|98101 7|57|M|administrator|91344
bucket2(000001_0) has below data: 3|23|M|writer|32067
Question2, I do not understand why data got stored into 2 buckets even though all same records has the same gender.
Then I again inserted data into the table using the below command. insert into table userinfobucketed select * from userinfo where gender='F';
Now 2 more extra buckets (000000_0_copy_1,000001_0_copy_1) are created and data stored into those instead of inserting data into existing buckets. Now that makes total buckets to 4 even though create table is configured into 2 buckets.
Question3 ; Why the extra buckets got created into of copying into existing buckets
please clarify
Thanks Sean
Upvotes: 1
Views: 3383
Reputation: 43
Q1: Why doesn't this work to insert into a bucketed table?
load data local inpath '/home/mainnode/u.user' into table userinfobucketed;
A1: Take a look at this tutorial for inserting into bucketed tables. Hive does not support loading to bucketed tables directly from a flat file using LOAD DATA INPATH
, so you have to LOAD
the data into a regular table first then INSERT OVERWRITE
into your bucketed table.
Q2: Why was the inserted data split into 2 buckets even though all records had the same value for the bucket column?
A2: Hmm. This is abnormal behavior. You should never see records with the same bucket column value getting hashed into different buckets. I suspect you did not drop the table and recreate it after trying the LOAD DATA INPATH
method above in Q1. If that were the case, new buckets would be created on the insert, disregarding what's in the existing buckets, which leads us to the next question...
Q3: Why were extra buckets created instead of inserting into existing buckets?
A3: Hive does not append new data to files on insert. Even though you told Hive that your table is bucketed, it only hashes the data you are currently inserting; it does not take into account the data already in the table.
To maintain the number of buckets set in the table definition, you will have to hash all the data together every time you do an insert, and use INSERT OVERWRITE
instead of INSERT INTO
to overwrite the table.
Generally this is much easier to do if your table is partitioned, so you're not copying and re-hashing your whole table every time you have to do an insert. And speaking of partitioning, since it is such low cardinality, gender is much better suited as a partition value than a bucket value. This article does a pretty good job at explaining this concept.
Upvotes: 4
Reputation: 63062
Bucketing is driven by the hash of the column. Apparently M and F are resulting in the same hash. You might consider making the gender part of the partitioning key - to ensure they end up in different physical files.
Upvotes: 1