histelheim
histelheim

Reputation: 5088

Computing how many folders each folder has in a complex folder structure?

Consider the following tree:

library(data.tree)

acme <- Node$new("Acme Inc.")
    accounting <- acme$AddChild("Accounting")
        software <- accounting$AddChild("New Software")
        standards <- accounting$AddChild("New Accounting Standards")
    research <- acme$AddChild("Research")
        newProductLine <- research$AddChild("New Product Line")
        newLabs <- research$AddChild("New Labs")
    it <- acme$AddChild("IT")
        outsource <- it$AddChild("Outsource")
        agile <- it$AddChild("Go agile")
        goToR <- it$AddChild("Switch to R")

I then want to compute the averageBranchingFactor:

averageBranchingFactor(acme)

This yields2.5

However, for various reasons I want to be able to get all the branching factors, not only the average branching factor. I need this to, for example, compare two file structures statistically with regards to significant differences across average branching factors.

According to the manual for data.tree the AverageBranchingFactor() function performs the following: "calculate the average number of branches each non-leaf has." Therefore, I first tried the following:

acme.df <- ToDataFrameTree(acme, "averageBranchingFactor")
mean(acme.df$averageBranchingFactor[acme.df$averageBranchingFactor>0])

This yields 2.375, which then lead me to try a simpler version:

mean(acme.df$averageBranchingFactor)

This yields 0.8636364

How do I arrive at all the individual branching factors that together have a mean of 2.5?

Ideally I would like to create a data.frame that lists every folder, with a variable where the branching factor is listed for every folder. For example, I have this very simply folder structure:

top_level_folder
    sub_folder_1
    sub_folder_2
         sub_folder_3

Answering the question would involve creating an output that looks like this:

Folders             Subfolders (BranchingFactor)
top_level_folder    2
sub_folder_1        0
sub_folder_2        1
sub_folder_3        0

The first column can simply be generated through calling list.dirs("/Users/username/Downloads/top_level/"), but I don't know how to generate the second column. Note that the second column is non-recursive, meaning that folders within subfolders are not counted (i.e. top_level_folder contains only 2 subfolders, even though sub_folder_2 contains another folder, sub_folder_2 ).

If you want to see whether your solution scales or not, download the Rails codebase: https://github.com/rails/rails/archive/master.zip and try it on Rails' more complex file structure.

Upvotes: 0

Views: 174

Answers (5)

Christoph Glur
Christoph Glur

Reputation: 1244

The averageBranchingFactor excludes leaves. Side note: you can get acme directly using data(acme).

library(data.tree)
data(acme)
acme$averageBranchingFactor
acme$count
print(acme, abf = "averageBranchingFactor", "count")

This will show like that:

                          levelName abf count
1  Acme Inc.                        2.5     3
2   ¦--Accounting                   2.0     2
3   ¦   ¦--New Software             0.0     0
4   ¦   °--New Accounting Standards 0.0     0
5   ¦--Research                     2.0     2
6   ¦   ¦--New Product Line         0.0     0
7   ¦   °--New Labs                 0.0     0
8   °--IT                           3.0     3
9       ¦--Outsource                0.0     0
10      ¦--Go agile                 0.0     0
11      °--Switch to R              0.0     0

The implementation of ?averageBranchingFactor does not bear any secrets, so you can tweak it to your needs. Simply type averageBranchingFactor into your console (without parenthesis):

function (node) 
{
    t <- Traverse(node, filterFun = isNotLeaf)
    if (length(t) == 0) 
        return(0)
    cnt <- Get(t, "count")
    if (!is.numeric(cnt)) 
        browser()
    return(mean(cnt))
}

In short, we traverse the tree (except leaves), and get the count value for each node. Finally, we calculate the mean.

Hope that helps.

Upvotes: 0

alistaire
alistaire

Reputation: 43354

You can adapt my answer on your other question, substituting list.dirs with recursive = FALSE for list.files:

library(purrr)

files <- .libPaths()[1] %>%    # omit for current directory or supply alternate path
    list.dirs() %>% 
    map_df(~list(path = .x, 
                 dirs = length(list.dirs(.x, recursive = FALSE))))

files
#> # A tibble: 4,457 x 2
#>                                                                           path  dirs
#>                                                                          <chr> <int>
#>  1              /Library/Frameworks/R.framework/Versions/3.4/Resources/library   314
#>  2        /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind     4
#>  3   /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/help     0
#>  4   /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/html     0
#>  5   /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/Meta     0
#>  6      /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/R     0
#>  7      /Library/Frameworks/R.framework/Versions/3.4/Resources/library/acepack     5
#>  8 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/acepack/help     0
#>  9 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/acepack/html     0
#> 10 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/acepack/libs     1
#> # ... with 4,447 more rows

mean(files$dirs[files$dirs != 0])
#> [1] 2.952949

or in base R,

files <- do.call(rbind, lapply(list.dirs(.libPaths()[1]), function(path){
    data.frame(path = path, 
               dirs = length(list.dirs(path, recursive = FALSE)), 
               stringsAsFactors = FALSE)
}))

head(files)
#>                                                                        path dirs
#> 1            /Library/Frameworks/R.framework/Versions/3.4/Resources/library  314
#> 2      /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind    4
#> 3 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/help    0
#> 4 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/html    0
#> 5 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/Meta    0
#> 6    /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/R    0

mean(files$dirs[files$dirs != 0])
#> [1] 2.952949

Upvotes: 0

Gilles San Martin
Gilles San Martin

Reputation: 4370

You can simply loop along the folder structure and count the nunber of folders (without recursivity) at each level :

dir.create("top_level_folder/sub_folder_2/sub_folder_3", recursive = TRUE)
dir.create("top_level_folder/sub_folder_1")


dirs <- list.dirs()
branching_factor <- vector(length = length(dirs))
for (i in 1:length(dirs)) {
    branching_factor[i] <- length(list.dirs(path = dirs[i], 
                                            full.names = FALSE, recursive = FALSE))
}

result <- data.frame(Folders = basename(dirs), BranchingFactor = branching_factor)
result[-1,]

You could also use a shorter, more idomatic and vectorised version of this code :

dirs <- list.dirs()
branching_factor <- sapply(dirs, function(x) length(list.dirs(x, FALSE, FALSE)))
result2 <- data.frame(Folders = basename(dirs), BranchingFactor = branching_factor, 
                      row.names = NULL)[-1,]

The results looks like that :

> head(result2[rev(order(result2[,2])),])
          Folders BranchingFactor
208      fixtures              24
122      fixtures              23
42       fixtures              18
440      core_ext              17
340 active_record              17
562         rails              16

Upvotes: 2

parth
parth

Reputation: 1631

Just correcting @Gilles solution,

path <- "SO/rails-master/"
dirs <- list.dirs(path)
branching_factor <- vector(length = length(dirs))
for (i in 1:length(dirs)) {
   branching_factor[i] <- length(list.dirs(path = dirs[i], recursive = FALSE))
}

result <- data.frame(Folders = basename(dirs), BranchingFactor = branching_factor)

> head(result)
       Folders BranchingFactor
1 rails-master              14
2      .github               0
3  actioncable               4
4          app               1
5       assets               1
6  javascripts               1

Hope this helps.

Upvotes: 1

moodymudskipper
moodymudskipper

Reputation: 47330

I'm taking a list of all folders recursively, then making a table of folder subfolder pairs, from these I can count the number of subfolder by folder.

I miss empty folders though, so I remerge this with the initial folders with a left join, and I fill in the NAs with zeroes.

path <- getwd()
all_folders <- path %>% list.dirs(full.names=TRUE,recursive=TRUE) %>% 

data.frame(stringsAsFactors=FALSE) %>% setNames("Folders")
all_sub_folders <- all_folders$Folders %>%
  strsplit("/") %>%
  lapply(function(x){c(x[length(x)-1],x[length(x)])}) %>%
  do.call(rbind,.) %>%
  as.data.frame(stringsAsFactors=FALSE) %>%
  setNames(c("ParentFolders","Folders"))
output <- all_sub_folders$ParentFolders %>% table %>% as.data.frame(stringsAsFactors=FALSE) %>% setNames(c("Folders","SubFolders")))
output <- merge(all_sub_folders,output,all.x = TRUE)[,c("Folders","SubFolders")]
output$SubFolders[is.na(output$SubFolders)] <- 0
output <- output[match(all_sub_folders$Folders,output$Folders),]

head(output)
#      Folders SubFolders
# 2160   Rhome        126
# 17   acepack          5
# 856     help          1
# 992     html          9
# 1486    libs        124
# 1130    i386          0

Upvotes: 0

Related Questions