Reputation: 95
I have to create an automated process to list all external tables in Hive and do a record count on those tables.
I should do this as a daily job. I tried this by hard coding all the external table names, but this is not accepted as the tables keep on changing once in a month.
I have gone through different approaches like [show tables]
and executing query in metastore DB. But these will not help me in automating the process.
Is there a better approach to implement this in Hive.
Upvotes: 1
Views: 1346
Reputation: 11
I liked the method of interacting with the Hive Metastore database in MySQL. You can wrap this in a shell script.
select d.NAME,t.TBL_NAME,t.TBL_TYPE,s.LOCATION from TBLS t JOIN DBS d ON t.DB_ID=d.DB_ID JOIN SDS s ON t.SD_ID=s.SD_ID where t.TBL_TYPE='EXTERNAL_TABLE';
Upvotes: 0
Reputation: 11
As per my understanding first your requirement is to find out list of Hive external tables in all hive databases .
Database name,Table name,Table Type(External) and HDFS Location of hive External tables.
Login to hive metastore and use Hive meta database.
Use 3 tables TBLS,DBS and SDS tables , on top of this 3 tables we can apply joins on DB_ID and SD_ID
By using above format you can get databases names along with respected hive external tables list and HDFS path location also .
For query and output information please read this link
Upvotes: 0
Reputation: 38335
Something like this, using shell.
#Create external table list for a schema
SCHEMA=your_schema_name
#define filenames
alltableslist=tables_$SCHEMA
exttablelist=ext_tables_$SCHEMA
#Get all tables
hive -S -e " set hive.cli.print.header=false; use $SCHEMA; show tables;" 1> $alltableslist
#For each table check its type:
for table in $(cat $alltableslist)
do
echo Processing table $table ...
#Describe table
describe=$(hive client -S -e "use $SCHEMA; DESCRIBE FORMATTED $table")
#Get type
table_type=$(echo "${describe}" | egrep -o 'Table Type:[^,]+' | cut -f2)
#Check table type, get count and write table name with count
if [ $table_type == EXTERNAL_TABLE ]; then
#get count
cnt=$(hive client -S -e "select count(*) from $SCHEMA.table ")
#save result
echo "$table $cnt" > $exttablelist
fi
done; #tables loop
Just replace your_schema_name
at the beginning with your schema name. External tables with counts in this example will be saved in the file ext_tables_[your_schema_name]
Counts can be processed in parallel and even in single SQL statement and many other things can be improved, but hope you have caught the idea.
Upvotes: 1