Reputation: 123

to populate bucketed tables in hive

I have created a hive table with gender as bucket column.

create table userinfoBucketed(userid INT,age INT,gender STRING,occupation STRING,zipcode STRING) CLUSTERED BY(gender) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;

loading following data from text file into table (user id | age | gender | occupation | zip code) :

1|24|M|technician|85711

2|53|F|other|94043

3|23|M|writer|32067

4|24|M|technician|43537

5|33|F|other|15213

6|42|M|executive|98101

7|57|M|administrator|91344

8|36|M|administrator|05201

9|29|M|student|01002

10|53|M|lawyer|90703

I have set the hive.enforce.bucketing property to true; set hive.enforce.bucketing=true;

1, when inserted data into table using load command, buckets are not created. all the data stored in one bucket

load data local inpath '/home/mainnode/u.user' into table userinfobucketed;

Question1, Why the data is not split into 2 buckets?

2, when inserted data into table from other table, data stored in 2 buckets. here is the command I have executed:

insert into table userinfobucketed select * from userinfo where gender='M';

Now bucket1(000000_0) has below data: 1|24|M|technician|85711 4|24|M|technician|43537 6|42|M|executive|98101 7|57|M|administrator|91344

bucket2(000001_0) has below data: 3|23|M|writer|32067

Question2, I do not understand why data got stored into 2 buckets even though all same records has the same gender.

Then I again inserted data into the table using the below command. insert into table userinfobucketed select * from userinfo where gender='F';

Now 2 more extra buckets (000000_0_copy_1,000001_0_copy_1) are created and data stored into those instead of inserting data into existing buckets. Now that makes total buckets to 4 even though create table is configured into 2 buckets.

Question3 ; Why the extra buckets got created into of copying into existing buckets

please clarify

Thanks Sean

Upvotes: 1

Answers (2)

ktmq

Reputation: 43

Q1: Why doesn't this work to insert into a bucketed table?

load data local inpath '/home/mainnode/u.user' into table userinfobucketed;

A1: Take a look at this tutorial for inserting into bucketed tables. Hive does not support loading to bucketed tables directly from a flat file using LOAD DATA INPATH, so you have to LOAD the data into a regular table first then INSERT OVERWRITE into your bucketed table.

Q2: Why was the inserted data split into 2 buckets even though all records had the same value for the bucket column?

A2: Hmm. This is abnormal behavior. You should never see records with the same bucket column value getting hashed into different buckets. I suspect you did not drop the table and recreate it after trying the LOAD DATA INPATH method above in Q1. If that were the case, new buckets would be created on the insert, disregarding what's in the existing buckets, which leads us to the next question...

Q3: Why were extra buckets created instead of inserting into existing buckets?

A3: Hive does not append new data to files on insert. Even though you told Hive that your table is bucketed, it only hashes the data you are currently inserting; it does not take into account the data already in the table.

To maintain the number of buckets set in the table definition, you will have to hash all the data together every time you do an insert, and use INSERT OVERWRITE instead of INSERT INTO to overwrite the table.

Generally this is much easier to do if your table is partitioned, so you're not copying and re-hashing your whole table every time you have to do an insert. And speaking of partitioning, since it is such low cardinality, gender is much better suited as a partition value than a bucket value. This article does a pretty good job at explaining this concept.

Upvotes: 4

WestCoastProjects

Reputation: 63062

Bucketing is driven by the hash of the column. Apparently M and F are resulting in the same hash. You might consider making the gender part of the partitioning key - to ensure they end up in different physical files.

Upvotes: 1

to populate bucketed tables in hive

Answers (2)

Related Questions