user8205906
user8205906

Reputation:

Hive: Does hive support partitioning and bucketing while usiing external tables

On using PARTITIONED BY or CLUSTERED BY keywords while creating Hive tables, hive would create separate files corresponding to each partition or bucket. But for external tables is this still valid. As my understanding is data files corresponding to external files are not managed by hive. So does hive create additional files corresponding to each partition or bucket and move corresponding data in to these files.

Edit - Adding details.
Few extracts from "Hadoop: Definitive Guide" - "Chapter 17: Hive"
CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);

When we load data into a partitioned table, the partition values are specified explicitly:

LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');

At the filesystem level, partitions are simply nested sub directories of the table directory. After loading a few more files into the logs table, the directory structure might look like this:

PartitionedTable_HiveWarehouseDirectoryListing

The above table was obviously a managed table, so hive had the ownership of data and created a directory structure for each partition as in the above tree structure.

In case of external table
CREATE EXTERNAL TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
Followed by same set of load operations -
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');

How will hive handle these partitions. As for external tables with out partition, hive will simply point to the data file and fetch any query result by parsing the data file. But in case of loading data in to a partitioned external table, where are the partitions created.

Hope fully in hive warehouse? Can someone support or clarify this?

Upvotes: 0

Views: 3096

Answers (2)

Tharun Kumar
Tharun Kumar

Reputation: 391

There is an easy way to do this. Create your External Hive table first.

CREATE EXTERNAL TABLE database.table (
    id integer,
    name string
)
PARTITIONED BY (country String)
LOCATION 'xxxx';

Next you have to run a MSCK command (metastore consistency check)

 msck repair table database.table

This command will recover all partitions that are available in your path and update the metastore. Now, if you run your query against your table, data from all partitions will be retrieved.

Upvotes: 0

invoketheshell
invoketheshell

Reputation: 3897

Suppose partitioning on date as this is a common thing to do.

CREATE EXTERNAL TABLE mydatabase.mytable (
    var1   double
  , var2   INT
  , date   String
)
PARTITIONED BY (date String)
LOCATION '/user/location/wanted/';

Then add all your partitions;

ALTER TABLE mytable ADD PARTITION( date = '2017-07-27' );
ALTER TABLE mytable ADD PARTITION( date = '2017-07-28' );

So on and so forth.

Finally you can add your data in the proper location. You will have an external partitioned file.

Upvotes: 0

Related Questions