Reputation: 275
I have a scenario find the lowest level child nodes from hierarchy table having parent_node_id and child_node_id as below. Source table is in Hive and Impala database. Please suggest hive/impala query to find out the lowest level child nodes for each parent node in source table.
I tried using CTE recursive query in Impala but I guess it is not supported.
Thank you in Advance!
Source Table:
+-------------+--------------+
|child_node_id|parent_node_id|
+-------------+--------------+
| C1 | P1 |
+-------------+--------------+
| C2 | P2 |
+-------------+--------------+
| C11 | C1 |
+-------------+--------------+
| C12 | C11 |
+-------------+--------------+
| 123 | C12 |
+-------------+--------------+
Expected Output:
+-------------+--------------+
|parent_node |lowest_l_child|
+-------------+--------------+
| P1 | 123 |
+-------------+--------------+
| P2 | C2 |
+-------------+--------------+
| C1 | 123 |
+-------------+--------------+
| C11 | 123 |
+-------------+--------------+
| C12 | 123 |
+-------------+--------------+
Upvotes: 0
Views: 2660
Reputation: 1197
Since hive does not support recursive CTE query.
Please refer to [https://blog.pythian.com/recursion-in-hive/][1] for one option.
Other option would be to use shell script to loop and query to find lowest child for all the parents.
Steps - 1> Intialization (One time run)
create temporary table hier_temp as select * from Hier;
create table if not exists res as select * from hier where false;
2) query to find lowest level child
insert into table res
select
H1.parent, H1.Child
from hier_temp H1 left outer join hier_temp H2
on H1.Child=H2.Parent where H2.Child is null;
3) Overwrite temp table with next level of child
insert overwrite table hier_temp
select
H1.Parent Parent, coalesce(H3.child, H2.child) as child
from hier_temp H1 left outer join hier_temp H2 on H1.Child=H2.Parent
left outer join res H3 on H2.child=H3.parent
where H2.Child is not null;
Create a shell script which will execute steps 2 and 3 in sequence in loop (Conditional statement with break and continue will do the job) till we don't have any data in hier_temp table.
Below are results from res and hier_temp table for the given test data.
hive> select * from res;
OK
Time taken: 0.131 seconds
hive> select * from hier_temp;
OK
C1 C11
C11 C12
C12 123
P1 C1
P2 C2
Time taken: 0.108 seconds, Fetched: 5 row(s)
Results after loop1 for queries mentioned in step 2 and 3
hive> select * from res;
OK
C12 123
P2 C2
Time taken: 0.137 seconds, Fetched: 2 row(s)
hive> select * from hier_temp;
OK
P1 C11
C1 123
C11 123
Time taken: 0.155 seconds, Fetched: 3 row(s)
Results after loop2 for queries mentioned in step 2 and 3
hive> select * from res;
OK
C12 123
P2 C2
C1 123
C11 123
Time taken: 0.11 seconds, Fetched: 4 row(s)
hive> select * from hier_temp;
OK
P1 123
Time taken: 0.111 seconds, Fetched: 1 row(s)
Results loop3 for queries mentioned in step 2 and 3
hive> select * from res;
OK
P1 123
C12 123
P2 C2
C1 123
C11 123
Time taken: 0.115 seconds, Fetched: 5 row(s)
hive> select * from hier_temp;
OK
Time taken: 0.16 seconds
This will get you the desired results, but you might have to consider time taken for execution.
Hope this helps
Upvotes: 2