Reputation: 77
I am trying to create a bucketed table in Hive in Cloudera. But, a normal table is created without any buckets.
First, I have created a normal table using named marks_temp using Hive CLI
CREATE TABLE marks_temp(
id INT,
Name string,
mark int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
I have loaded following data into marks_temp table from a text file which is 'Desktop/Data/littlebigdata.txt'
101,Firdaus,88
102,Pranav,78
103,Rahul,65
104,Sanjoy,65
105,Firdaus,88
106,Pranav,78
107,Rahul,65
108,Sanjoy,65
109,Amar,54
110,Sahil,34
111,Rahul,45
112,Rajnish,67
113,Ranjeet,56
114,Sanjoy,34
I have loaded above data using below command
LOAD DATA LOCAL INPATH 'Desktop/Data/littlebigdata.txt'
INTO TABLE marks_temp;
After successful data loading, I am creating a bucketed table named as marks_temp
CREATE TABLE marks_bucketed(
id INT,
Name string,
mark int
)
CLUSTERED BY (id) INTO 4 BUCKETS;
Now, I am inserting data in marks_bucketed table from marks_temp table.
INSERT INTO marks_bucketed
SELECT id,Name, mark FROM marks_temp;
After this some jobs start to run. What, I have observed in the job logs that it is saying "Number of reduce tasks is set to 0 since there's no reduce operator"
hive> insert into marks_bucketed
> select id,Name,mark from marks_temp;
Query ID = cloudera_20180601035353_29b25ffe-541e-491e-aea6-b36ede88ed79
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1527668582032_0004, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1527668582032_0004/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1527668582032_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-06-01 03:54:01,328 Stage-1 map = 0%, reduce = 0%
2018-06-01 03:54:14,444 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.21 sec
MapReduce Total cumulative CPU time: 2 seconds 210 msec
Ended Job = job_1527668582032_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/marks_bucketed/.hive-staging_hive_2018-06-01_03-53-45_726_2788383119636056364-1/-ext-10000
Loading data to table default.marks_bucketed
Table default.marks_bucketed stats: [numFiles=1, numRows=14, totalSize=194, rawDataSize=180]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.21 sec HDFS Read: 3937 HDFS Write: 273 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 210 msec
OK
Time taken: 31.307 seconds
Even, Hue File Browser is showing only one file. Screenshot is attached. Hue File Browser screenshot for marks_bucketed table
Upvotes: 2
Views: 295
Reputation: 7947
From the Hive documentation
Version 0.x and 1.x only
The command set hive.enforce.bucketing = true; allows the correct number of reducers and the cluster by column to be automatically selected based on the table. Otherwise, you would need to set the number of reducers to be the same as the number of buckets as in set mapred.reduce.tasks = 256; and have a CLUSTER BY ... clause in the select.
so you will need to set the property to force the bucketing or go for the manual option and run your query like
set mapred.reduce.tasks = 4;
select id,Name,mark from marks_temp cluster by id;
Upvotes: 1