Getting counts at each level of a hierarchical query

Question

With Reference to one of the question answer before, Getting counts/totals at each level of a hierarchical query using CONNECT BY

DIRS
====
DIR_ID 
PARENT_DIR_ID
DIR_NAME

FILES
=====
FILE_ID
FILE_NAME
DIR_ID
FILE_SIZE


       If DIRS contains:

DIR_ID   PARENT_DIR_ID   DIR_NAME
======   =============   ========
1                        ROOT
2        1               DIR1_1
3        1               DIR1_2
4        2               DIR2_1
5        2               DIR2_2

Out Out is shown below.



 FILE_ID   FILE_NAME   DIR_ID   FILE_SIZE
    =======   =========   ======   =========
    1         test1.txt   5        100
    2         test2.txt   5        200
    3         test5.txt   5         50 
    4         test3.txt   3        300
    5         test4.txt   3        300
    6         test6.txt   4        100

I want a query that returns the path along with the number of files in or below each node in the hierarchy. Basically a rollup of the number of files. So the query result would look something like:

Path                    File_Count    File_Size
=====                   ===========   --------
/ROOT                   6               1050
/ROOT/DIR1_1            4               450 
/ROOT/DIR1_1/DIR2_1     1               100 
/ROOT/DIR1_1/DIR2_2     3               350  
/ROOT/DIR1_2            2               600

  1  select sys_connect_by_path(dp.dir_name, '/') path    
  2         ,(select count(file_id)                       
  3             from dirs dc                              
  4                  ,files f                             
  5            where f.dir_id(+) = dc.dir_id              
  6          connect by prior dc.dir_id = dc.parent_dir_id
  7            start with dc.dir_id = dp.dir_id           
  8          ) count                                      
  9    from dirs dp                                       
 10    connect by prior dp.dir_id = dp.parent_dir_id      
 11*   start with dp.parent_dir_id is null

How can i also get count of File size column next to size column.

This Query is doing fine except sum of the count of file size.

           (select count(file_id)                       
               from dirs dc                              
                    ,files f                             
              where f.dir_id(+) = dc.dir_id              
            connect by prior dc.dir_id = dc.parent_dir_id
              start with dc.dir_id = dp.dir_id           
            ) count, 
            (select sum(file_size)                       
                       from dirs dc                              
                            ,files f                             
                      where f.dir_id(+) = dc.dir_id              
                    connect by prior dc.dir_id = dc.parent_dir_id
                      start with dc.dir_id = dp.dir_id           
            ) size   
      from dirs dp                                       
     connect by prior dp.dir_id = dp.parent_dir_id      
start with dp.parent_dir_id is null

I modified the query to get size of files of each level but no success. Any idea what is missing?

All i need is the count of the File_Size column as shown above.

Timekiller · Accepted Answer

I thought up multiple solutions, might as well list them all here.

Subquery approach

This is what you use - list every directory, do two recursive subqueries for each directory. I basically repeated what you wrote here, and it works as expected. Your error was to use reserved words count and size for column names.

select sys_connect_by_path(dp.dir_name, '/') path,
(select count(file_id)
   from dirs dc, files f
  where f.dir_id(+) = dc.dir_id
connect by prior dc.dir_id = dc.parent_dir_id
  start with dc.dir_id = dp.dir_id
  ) cnt,
(select sum(file_size)
   from dirs dc, files f
  where f.dir_id(+) = dc.dir_id
connect by prior dc.dir_id = dc.parent_dir_id
  start with dc.dir_id = dp.dir_id
  ) sz
from dirs dp
connect by prior dp.dir_id = dp.parent_dir_id
start with dp.parent_dir_id is null;

However, on larger datasets this can be quite slow. There are other ways.

Straightforward join approach

This one is not very efficient and has some restraints, but seems fairly easy to understand. It executed a bit faster than subquery for me.

1) Get full directory hierarchy

2) Join files against (1) to find out full dirname for each file

3) Join (1) and (2) on file_dirname like dirname||'%' to get a list of files which belong to specified directory or any of its subdirectories, then group by (1).dir_id and get your sums.

Note that dirnames must all end with '/' for this to work properly; also, directory full path must be unique.

It will look something like that:

with dirs_r as
(select sys_connect_by_path(dp.dir_name, '/')||'/' path, dir_id
  from dirs dp                                       
connect by prior dp.dir_id = dp.parent_dir_id      
  start with dp.parent_dir_id is null
)

select dirs_r.path, count(*), sum(files_q.file_size)
  from dirs_r,
  (select d.path, f.file_size
    from files f, dirs_r d
   where d.dir_id = f.dir_id) files_q
 where files_q.path like dirs_r.path||'%'
group by dirs_r.path
order by dirs_r.path

Reverse approach:

This might be a bit harder to understand, but is much more efficient according to explain plan, and it also only uses IDs, so it's more reliable. It also executed much faster than both of the above.

1) Build a reverse directory hierarchy for every directory using dp.dir_id = prior dp.parent_dir_id and connect_by_root(dir_id). This way, you get rows for each directory id and all its parent directories.

2) Join (1) with files, get rows for each parent directory of each file.

3) Join (2) with direct hierarchy on dir_id and then group.

select dirs_r.path, count(*), sum(filedir.file_size) from
(select sys_connect_by_path(dp.dir_name, '/') path, dir_id
  from dirs dp                                       
connect by prior dp.dir_id = dp.parent_dir_id      
  start with dp.parent_dir_id is null
) dirs_r,

(select f.file_size, dirs_reverse.dir_id dir_id
  from files f,
       (select connect_by_root(dp.dir_id) inner_dir_id, dp.dir_id
          from dirs dp
       connect by dp.dir_id = prior dp.parent_dir_id  
       ) dirs_reverse
 where dirs_reverse.inner_dir_id = f.dir_id) filedir

where dirs_r.dir_id = filedir.dir_id
group by dirs_r.path
order by dirs_r.path

Getting counts at each level of a hierarchical query

Answers (1)

Related Questions