Reputation: 469
I have a set of hive tables that are not in ORC format and also not bucketed. I want to change their formats to ORC as well as make them bucketed. Couldn't find a concrete answer throughout the net. Any answer or guidance is appreciated. Hive version is 2.3.5
Or if it is possible to do it in spark (pyspark or scala)?
The simplest solution would be to create a new table which is bucketed and is in ORC format then insert into it from the old table. Looking for an in-place solution.
Upvotes: 0
Views: 2924
Reputation: 507
Hive:
Use a staging table to read the un-bucketed data (assuming TEXTFILE
format) using these commands:
CREATE TABLE staging_table(
col1 colType,
col2 colType, ...
coln colType
)
STORED AS
TEXTFILE
LOCATION
'/path/of/input/data';
CREATE TABLE target_table(
col1 colType,
col2 colType, ...
coln colType
)
CLUSTERED BY(col1) INTO 10 BUCKETS
STORED AS ORC;
INSERT OVERWRITE TABLE table_bucketed
SELECT
col1, col2, ..., coln
FROM
staging_table;
The same can be done in **Spark** DataFrame APIs
(assuming CSV
format) like this:
df = spark.read.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.option("delimiter", ",")
.option("path", "/path/of/input/data/")
.load()
df.write.format("orc")
.option("path", "/path/of/output/data/")
.save()
Upvotes: 2
Reputation: 38335
Create bucketed table and load data into it using INSERT OVERWRITE:
CREATE TABLE table_bucketed(col1 string, col2 string)
CLUSTERED BY(col1) INTO 10 BUCKETS
STORED AS ORC;
INSERT OVERWRITE TABLE table_bucketed
select ...
from table_not_bucketed
See also Sorted Bucketed Table.
Upvotes: 1