Reputation: 375
I have a hive table that stores a drive's path to the leaf node, and how many files it has. And want to build a new hive table that can look up how many files exists under any folder depth. Can I get some tips on how to achieve this?
My Input table is:
>select * from default.file_file
path file_count
/teamA/projectA/ 3
/teamA/projectB/ 2
I'm trying to build the output table so that it can look like:
>select * from default.output_table
path file_count
/ 5
/teamA/ 5
/teamA/projectA/ 3
/teamA/projectB/ 2
So Far I tried the lateral_view + explode method, but that doesn't allow me to keep track of the directory hierarchy (each /
is stored in separate rows).
Upvotes: 1
Views: 301
Reputation: 38325
You can split path, then use running collect_set analytic function + concat_ws to build path again. After this aggregate file_count group by path:
Demo:
SELECT stack (2,
'/teamA/projectA/', 3,
'/teamA/projectB/', 2
) AS (path, file_count)
)
select path, sum(file_count) file_count
from
(
select t.file_count,
concat(concat_ws('/',collect_set(s.node) over(partition by t.path order by level))
, '/'
) path
from Table1 t
--remove trailing /, split and explode
lateral view posexplode(split(regexp_replace(t.path,'/$',''),'/')) s as level, node
)s
group by path;
Result:
path file_count
/ 5
/teamA/ 5
/teamA/projectA/ 3
/teamA/projectB/ 2
Upvotes: 1