Reputation: 5790
Background. I can calculate size of my hive partitioned table like below. <i.e. finding size of all partition and then sum-up sizes of needed partitions>
hdfs dfs -du 'hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1'
100 hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1/dt=2021-10-06
200 hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1/dt=2021-10-07
300 hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1/dt=2021-10-08
400 hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1/dt=2021-10-09
500 hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1/dt=2021-10-10
600 hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1/dt=2021-10-11
I want to find out size of table between specific partitions <i.e. dt >= 2021-10-09 >
i.e. size of test.tbl1 between dt=2021-10-09 to dt=2021-10-11 partitions is = 400+500+600 = 1500 Bytes
Is there any way to filter out result during dfs -du phase to avoid manual summation?
Upvotes: 0
Views: 502
Reputation: 29230
With a POSIX awk (e.g. GNU or BSD awk) you can pre-process the last field ($NF
) of each line with the gsub
function to keep only the date. You can then compute the sum of the first field ($1
) if and only if the last field ($NF
) is in a defined range. Example for dt >= 2021-10-09
:
$ hdfs dfs -du 'hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1' |
awk -v dtmin='2021-10-09' '
{gsub(/.*=/,"",$NF)} $NF>=dtmin{sum+=$1} END{print sum}'
1500
For a closed range:
$ hdfs dfs -du 'hdfs://localhost:9090/user/temp/warehouse/test.db/tbl1' |
awk -v dtmin='2021-10-06' -v dtmax='2021-10-08' '
{gsub(/.*=/,"",$NF)} $NF>=dtmin && $NF<=dtmax{sum+=$1} END{print sum}'
600
Upvotes: 1