How can I calculate the size of hive table with specific partitions filter?

Question

Background. I can calculate size of my hive partitioned table like below.

hdfs dfs -du 'hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1'
100   hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1/dt=2021-10-06
200   hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1/dt=2021-10-07
300   hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1/dt=2021-10-08
400   hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1/dt=2021-10-09
500   hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1/dt=2021-10-10
600   hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1/dt=2021-10-11

I want to find out size of table between specific partitions = 2021-10-09 >

i.e. size of test.tbl1 between dt=2021-10-09 to dt=2021-10-11 partitions is = 400+500+600 = 1500 Bytes

Is there any way to filter out result during dfs -du phase to avoid manual summation?

Renaud Pacalet · Accepted Answer

With a POSIX awk (e.g. GNU or BSD awk) you can pre-process the last field ($NF) of each line with the gsub function to keep only the date. You can then compute the sum of the first field ($1) if and only if the last field ($NF) is in a defined range. Example for dt >= 2021-10-09:

$ hdfs dfs -du 'hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1' |
  awk -v dtmin='2021-10-09' '
    {gsub(/.*=/,"",$NF)} $NF>=dtmin{sum+=$1} END{print sum}'
1500

For a closed range:

$ hdfs dfs -du 'hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1' |
  awk -v dtmin='2021-10-06' -v dtmax='2021-10-08' '
    {gsub(/.*=/,"",$NF)} $NF>=dtmin && $NF<=dtmax{sum+=$1} END{print sum}'
600

How can I calculate the size of hive table with specific partitions filter?

Answers (1)

Related Questions