axnet
axnet

Reputation: 5790

How to get min and max partition values from hive table using cli?

I have variety of tables in hive with minimum 0 to maximum 4 partition columns.

Below is HDFS representation of few tables with partitions ranging from 0 to 4.

-- type-0 <no partitions>
hdfs://ns/user/abc/warehouse/test_db/test_tbl_0/__SNAPPY.gz


-- type-1 <1 partition column in table  = dt>
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-14/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-30/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-12-16/__SNAPPY.gz


-- type-2 <2 partition columns in table = dt, hh>
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-14/hh=01/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-15/hh=02/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-12-19/hh=03/__SNAPPY.gz


-- type-3 <3 partition columns in table = client, dt, hh>
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-29/hh=01/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-12-20/hh=04/__SNAPPY.gz


-- type-4 <4 partition columns in table = service, geo, dt, hh>
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-14/hh=01/__SNAPPY.gz   
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-20/hh=01/__SNAPPY.gz
...
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-12-13/hh=21/__SNAPPY.gz

Expected output for type 0 to 4 As requested by markp-fuso

DBName  TableName  MIN_PARTITION(s) MAX_PARTITION(s)
test_db    test_tbl_0

test_db_a  test_tbl_1 dt=2020-11-14    dt=2020-12-16

test_db_b  test_tbl_2 dt=2020-11-14/hh=01 dt=2020-12-19/hh=03/

test_db_c  test_tbl_3 client=cobra/dt=2020-11-14/hh=01 client=cobra/dt=2020-12-20/hh=04

test_db_d  test_tbl_4 service=mobile/geo=us/dt=2020-11-14/hh=01 service=mobile/geo=us/dt=2020-12-13/hh=21

Below is what I have tried for type-2 <table with 2 partitions>.

## Getting Minimum and Maximum partition lines, 
### here i am removing lines of hdfs output like 'Found 20 items'>
hdfs dfs -ls 'hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/*' | grep -v '^Found' | sort -k6,7 | awk '{print $8}' | (head -n1 && tail -n1)
/*
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2019-03-12/hh=00
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2021-07-28/hh=22
*/

## Here i am further trying to simplify the output 
hdfs dfs -ls 'hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/*' | grep -v '^Found' | sort -k6,7 | awk '{print $8}' | (head -n1 && tail -n1) | awk -F'/' '{print $(NF-2),$(NF-1),$NF}' | sed ':a;N;$!ba;s/\n/ /g'
/*
test_tbl_2 dt=2019-03-12 hh=00  test_tbl_2 dt=2021-07-28 hh=22
*/

As we can see above I am getting output in below format.

TableName  MIN_PARTITION(s)  TableName  MAX_PARTITION(s)   

Also above method i have tested only on table with 2 partitions, is there any generic bash hack that gives me in below format and regardless of any number of partitions?

DBName  TableName  MIN_PARTITION(s) MAX_PARTITION(s)  

Upvotes: 1

Views: 885

Answers (1)

markp-fuso
markp-fuso

Reputation: 34698

UPDATE: question updated with more sample inputs as well as the matching (desired) ouputs

Assumptions:

  • inputs for a given db/table pair are on consecutive lines so we can generate output as we exhaust the input for a given db/table pair (otherwise we'd need to store all data in memory - eg, arrays - and then print all output once the entire input stream has been exhausted)
  • output format has 4 columns: DBName TableName MinPartition MaxPartition
  • if a db/table pair only has one line of input then the min and max columns will contain the same value
  • using a / as a field delimiter, the 'last' field (__SNAPPY.gz in the sample inputs) is to be ignored

Sample inputs to be used for demo purposes:

$ cat hdfs.input
# no min/max for test_db/test_tbl_0

hdfs://ns/user/abc/warehouse/test_db/test_tbl_0/__SNAPPY.gz

hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-14/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-11-30/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_a/test_tbl_1/dt=2020-12-16/__SNAPPY.gz

hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-11-15/hh=02/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2020-12-19/hh=03/__SNAPPY.gz

hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-11-29/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_c/test_tbl_3/client=cobra/dt=2020-12-20/hh=04/__SNAPPY.gz

hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-14/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-11-20/hh=01/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_d/test_tbl_4/service=mobile/geo=us/dt=2020-12-13/hh=21/__SNAPPY.gz

hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2019-03-12/hh=00/__SNAPPY.gz
hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_2/dt=2021-07-28/hh=22/__SNAPPY.gz

# min=max for test_db_b/test_tbl_7

hdfs://ns/user/abc/warehouse/test_db_b/test_tbl_7/dt=2021-07-28/hh=22/__SNAPPY.gz

One awk idea:

awk -F'/' '
function printline() {
        if ( dbname != "") print dbname, tabname, minpart, maxpart
        minpart = maxpart = ""
}
/^hdfs/ { if ( $7 != dbname || $8 != tabname )
             printline()
          dbname = $7
          tabname = $8
          if ( $10 == "" ) {
             minpart = maxpart = ""
             next
          }
          pfx = ""
          currpart = ""
          for (i=9; i<NF; i++) {
              currpart = currpart pfx $i
              pfx=FS
          }
          minpart = ( (minpart == "") || (currpart < minpart) ) ? currpart : minpart
          maxpart = ( (maxpart == "") || (currpart > maxpart) ) ? currpart : maxpart
        }
END     { printline() }
' hdfs.input

This generates:

test_db test_tbl_0
test_db_a test_tbl_1 dt=2020-11-14 dt=2020-12-16
test_db_b test_tbl_2 dt=2020-11-14/hh=01 dt=2020-12-19/hh=03
test_db_c test_tbl_3 client=cobra/dt=2020-11-14/hh=01 client=cobra/dt=2020-12-20/hh=04
test_db_d test_tbl_4 service=mobile/geo=us/dt=2020-11-14/hh=01 service=mobile/geo=us/dt=2020-12-13/hh=21
test_db_b test_tbl_2 dt=2019-03-12/hh=00 dt=2021-07-28/hh=22
test_db_b test_tbl_7 dt=2021-07-28/hh=22 dt=2021-07-28/hh=22

Upvotes: 1

Related Questions