Alan
Alan

Reputation: 469

Is it possible to convert a hive table format to ORC and make it bucketed

I have a set of hive tables that are not in ORC format and also not bucketed. I want to change their formats to ORC as well as make them bucketed. Couldn't find a concrete answer throughout the net. Any answer or guidance is appreciated. Hive version is 2.3.5

Or if it is possible to do it in spark (pyspark or scala)?

The simplest solution would be to create a new table which is bucketed and is in ORC format then insert into it from the old table. Looking for an in-place solution.

Upvotes: 0

Views: 2924

Answers (2)

Kumar Rohit
Kumar Rohit

Reputation: 507

Hive: Use a staging table to read the un-bucketed data (assuming TEXTFILE format) using these commands:

CREATE TABLE staging_table(
    col1 colType, 
    col2 colType, ...
    coln colType
)
STORED AS 
    TEXTFILE
LOCATION 
    '/path/of/input/data';

CREATE TABLE target_table(
    col1 colType, 
    col2 colType, ...
    coln colType
)
CLUSTERED BY(col1) INTO 10 BUCKETS
STORED AS ORC;

INSERT OVERWRITE TABLE table_bucketed
SELECT 
    col1, col2, ..., coln
FROM 
    staging_table;

The same can be done in **Spark** DataFrame APIs (assuming CSV format) like this:

df = spark.read.format("csv")
          .option("inferSchema", "true")
          .option("header", "true")
          .option("delimiter", ",")
          .option("path", "/path/of/input/data/")
          .load()

df.write.format("orc")
        .option("path", "/path/of/output/data/")
        .save()

Upvotes: 2

leftjoin
leftjoin

Reputation: 38335

Create bucketed table and load data into it using INSERT OVERWRITE:

CREATE TABLE table_bucketed(col1 string, col2 string)
CLUSTERED BY(col1) INTO 10 BUCKETS
STORED AS ORC;

INSERT OVERWRITE TABLE table_bucketed
select ...
  from table_not_bucketed

See also Sorted Bucketed Table.

Upvotes: 1

Related Questions