Md Firdaus Alam
Md Firdaus Alam

Reputation: 77

Unable to create buckets in Hive in cloudera

I am trying to create a bucketed table in Hive in Cloudera. But, a normal table is created without any buckets.

First, I have created a normal table using named marks_temp using Hive CLI

CREATE  TABLE marks_temp(
id INT,
Name string,
mark int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

I have loaded following data into marks_temp table from a text file which is 'Desktop/Data/littlebigdata.txt'

101,Firdaus,88
102,Pranav,78
103,Rahul,65
104,Sanjoy,65
105,Firdaus,88
106,Pranav,78
107,Rahul,65
108,Sanjoy,65
109,Amar,54
110,Sahil,34
111,Rahul,45
112,Rajnish,67
113,Ranjeet,56
114,Sanjoy,34 

I have loaded above data using below command

LOAD DATA LOCAL INPATH 'Desktop/Data/littlebigdata.txt'
INTO TABLE  marks_temp;

After successful data loading, I am creating a bucketed table named as marks_temp

CREATE TABLE marks_bucketed(
id INT,
Name string,
mark int
)
CLUSTERED BY (id) INTO 4 BUCKETS;

Now, I am inserting data in marks_bucketed table from marks_temp table.

INSERT INTO marks_bucketed
SELECT id,Name, mark FROM marks_temp;

After this some jobs start to run. What, I have observed in the job logs that it is saying "Number of reduce tasks is set to 0 since there's no reduce operator"

hive> insert into marks_bucketed
        > select id,Name,mark from marks_temp;
    Query ID = cloudera_20180601035353_29b25ffe-541e-491e-aea6-b36ede88ed79
    Total jobs = 3
    Launching Job 1 out of 3
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_1527668582032_0004, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1527668582032_0004/
    Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1527668582032_0004
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
    2018-06-01 03:54:01,328 Stage-1 map = 0%,  reduce = 0%
    2018-06-01 03:54:14,444 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.21 sec
    MapReduce Total cumulative CPU time: 2 seconds 210 msec
    Ended Job = job_1527668582032_0004
    Stage-4 is selected by condition resolver.
    Stage-3 is filtered out by condition resolver.
    Stage-5 is filtered out by condition resolver.
    Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/marks_bucketed/.hive-staging_hive_2018-06-01_03-53-45_726_2788383119636056364-1/-ext-10000
    Loading data to table default.marks_bucketed
    Table default.marks_bucketed stats: [numFiles=1, numRows=14, totalSize=194, rawDataSize=180]
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 1   Cumulative CPU: 2.21 sec   HDFS Read: 3937 HDFS Write: 273 SUCCESS
    Total MapReduce CPU Time Spent: 2 seconds 210 msec
    OK
    Time taken: 31.307 seconds

Even, Hue File Browser is showing only one file. Screenshot is attached. Hue File Browser screenshot for marks_bucketed table

Upvotes: 2

Views: 295

Answers (1)

hlagos
hlagos

Reputation: 7947

From the Hive documentation

Version 0.x and 1.x only

The command set hive.enforce.bucketing = true; allows the correct number of reducers and the cluster by column to be automatically selected based on the table. Otherwise, you would need to set the number of reducers to be the same as the number of buckets as in set mapred.reduce.tasks = 256; and have a CLUSTER BY ... clause in the select.

so you will need to set the property to force the bucketing or go for the manual option and run your query like

set mapred.reduce.tasks = 4;
select id,Name,mark from marks_temp cluster by id;

Upvotes: 1

Related Questions