Reputation: 944
I have an Oracle table Tab_O and its equivalent table in Hive Tab_H. Both tables are partitioned based on column Part_Col.
Tab_O/ Tab_H
+---+---+---+----------+-----+
| A | B | C | Part_Col | Bal |
+---+---+---+----------+-----+
| 2 | 1 | 9 | 2005 | 100 |
+---+---+---+----------+-----+
| 3 | 1 | 9 | 2005 | 400 |
+---+---+---+----------+-----+
| 3 | 2 | 8 | 2005 | 300 |
+---+---+---+----------+-----+
| 4 | 1 | 9 | 2005 | 200 |
+---+---+---+----------+-----+
The Oracle table has subpartitions on columns B and C, eg. B1_C9
, which can be checked from ALL_TABS_SUBPARTITIONS
. I perform different operations based on whether or not the subpartition exists.
But I cannot check for subpartitions in Hive. So would just checking if the data exists be a sufficient workaround?
SELECT COUNT(*) FROM TAB_H WHERE B=1 AND C=9;
Or could this cause some other problem?
Upvotes: 0
Views: 1280
Reputation: 38290
If statistics is not available or may be not available or may be stale for (sub)partitions then better and safer to use LIMIT 1
for checking data rather than count because in such case count(*)
may cause full (sub)partition scan. In case you are using statistics for count(*)
(set hive.compute.query.using.stats=true;
) and statistics is stale then you will receive wrong result.
This should run without starting map-reduce, as a fetch-only task:
SELECT 1 FROM TAB_H WHERE B=1 AND C=9 limit 1;
You can wrap above command in a shell script, analyze result.
Also you can check folders exist.
Hive partitions are hierarchical folder structure. You can use shell script with hadoop fs
for checking folder exists and pass the result to Hive scripts if necessary using hiveconf
or hivevar
variables or conditionally execute your Hive scripts from the shell. Shell command example:
if $(hadoop fs -test -d $your_sub_partition_dir) ; then echo "exists";else echo "not exists"; fi
Note that partition folder may be empty, partition exists but does not contain any data. This is quite possible if data files were deleted without dropping partition. So the first approach may be preferable.
Upvotes: 1