Venkatesan Muniappan
Venkatesan Muniappan

Reputation: 447

Better process or tools to deploy large number of Hive tables

We are maintaining a data platform for a big financial institution. We have many Dev, SIT, and UAT servers and in each of them, we have 1000s of Hive tables. The challenge comes when we need to sync up all these servers to have identical hive table schemas. We do this schema sync up every 2 to 3 months.

Currently, we keep the table definition for each of the databases in a single file with DROP and CREATE statements and use the beeline command to execute the statements in the files. Each DROP and CREATE statement takes at least 5 seconds and for 1000s of tables, it takes hours to finish the deployment.

We are working on finding better tools or techniques to speed up the deployment. What we did so far is do some kind of comparative study to compare the performance of different tools like spark-shell, impala, and hcat commands. Our study showed that beeline performs better for CREATE/DROP statements than other tools/commands by at least 20-30%.

If we are not finding better tools, our next step is instead of deploying all the tables all the time, build a mechanism to keep a state or version of each of the tables in a table(preferably in RDBMS). Using this state table to identify what tables need sync up and deploy only those tables so that we are not deploying tables that didn't change since the last time we executed the schema sync up.

Posting this question in this forum to see if there are any better ways to deploy a large number of Hive tables.

Upvotes: 1

Views: 124

Answers (1)

leftjoin
leftjoin

Reputation: 38325

CREATE EXTERNAL TABLE and MSCK REPAIR/RECOVER PARTITIONS can be extremely slow because basic stats are gathered.

Try to disable statistics auto gathering before creating tables:

set hive.stats.autogather=false;

Upvotes: 1

Related Questions