CodeDBA
CodeDBA

Reputation: 3

How to count the number of s3 folders inside given path?

I tried to search for this solution through out but wasn't lucky. Hoping to find some solution quickly here. I have some migrated files in S3 and now there is a requirement to identify the number of folders involved in the give path. Say I have some files with as below.

If I give aws s3 ls s3://my-bucket/foo1 --recursive >> file_op.txt

"cat file_op.txt" - will look something like below:

my-bucket/foo1/foo2/foo3/foo4/foo5/foo6/foo7/file1.txt
my-bucket/foo1/foo2/foo3/foo4/foo5/foo6/foo7/file2.txt
my-bucket/foo1/foo2/foo3/foo4/foo5/foo6/file1.pdf
my-bucket/foo1/foo2/foo3/foo4/foo6/file2.txt
my-bucket/foo1/foo2/foo3/file3.txt
my-bucket/foo1/foo8/file1.txt
my-bucket/foo1/foo9/foo10/file4.csv

I have stored the output in a file and processed to find the number of files by wc -l But I couldn't find the number of folders involved in the path.

I need the output as below:

number of files : 7
number of folders : 9

EDIT 1: Corrected the expected number of folders.

(Excluding my-bucket and foo1)

(foo6 is in foo5 and foo4 directories)

Below is my code where I'm failing in calculating the count of directories:

#!/bin/bash
if [[ "$#" -ne 1 ]] ; then
    echo "Usage: $0 \"s3 folder path\" <eg. \"my-bucket/foo1\"> "
    exit 1
else
    start=$SECONDS
    input=$1
    input_code=$(echo $input | awk -F'/' '{print $1 "_" $3}')
    #input_length=$(echo $input | awk -F'/' '{print NF}' )
    s3bucket=$(echo $input | awk -F'/' '{print $1}')
    db_name=$(echo $input | awk -F'/' '{print $3}')
    pathfinder=$(echo $input | awk 'BEGIN{FS=OFS="/"} {first = $1; $1=""; print}'|sed 's#^/##g'|sed 's#$#/#g')
    myn=$(whoami)
    cdt=$(date +%Y%m%d%H%M%S)
    filename=$0_${myn}_${cdt}_${input_code}
    folders=${filename}_folders
    dcountfile=${filename}_dir_cnt
    aws s3 ls s3://${input} --recursive | awk '{print $4}' > $filename
    cat $filename |awk -F"$pathfinder" '{print $2}'| awk 'BEGIN{FS=OFS="/"}{NF--; print}'| sort -n | uniq > $folders
    #grep -oP '(?<="$input_code" ).*'
    fcount=`cat ${filename} | wc -l`
    awk 'BEGIN{FS="/"}
    {   if (NF > maxNF)
             {
                 for (i = maxNF + 1; i <= NF; i++)
                     count[i] = 1;
                 maxNF = NF;
             }
             for (i = 1; i <= NF; i++)
             {
                 if (col[i] != "" && $i != col[i])
                    count[i]++;
                 col[i] = $i;
             }
         }
         END {
             for (i = 1; i <= maxNF; i++)
                 print count[i];
    }'  $folders > $dcountfile
    dcount=$(cat $dcountfile | xargs | awk '{for(i=t=0;i<NF;) t+=$++i; $0=t}1' )
    printf "Bucket name : \e[1;31m $s3bucket \e[0m\n" | tee -a  ${filename}.out
    printf "DB name : \e[1;31m $db_name \e[0m\n" | tee -a  ${filename}.out
    printf "Given folder path : \e[1;31m $input \e[0m\n" | tee -a  ${filename}.out
    printf "The number of folders in the given directory are\e[1;31m $dcount \e[0m\n" | tee -a ${filename}.out
    printf "The number of files in the given directory are\e[1;31m $fcount \e[0m\n" | tee -a ${filename}.out
    end=$SECONDS
    elapsed=$((end - start))
    printf '\n*** Script completed in %d:%02d:%02d - Elapsed %d:%02d:%02d ***\n' \
           $((end / 3600)) $((end / 60 % 60)) $((end % 60)) \
           $((elapsed / 3600)) $((elapsed / 60 % 60)) $((elapsed % 60)) | tee -a ${filename}.out
    exit 0
fi

Upvotes: 0

Views: 2779

Answers (3)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2809

If you don't want mind using a pipe and calling awk twice, then it's rather clean :

 mawk 'BEGIN {OFS=ORS;FS="/";_^=_}_+_<NF && --NF~($_="")' file \    
 \
 | mawk 'NF {_[$__]} END { print length(_) }'

Upvotes: 0

ikegami
ikegami

Reputation: 385496

You have clarified that you wanted to count the unique names, ignoring the top two levels (my-bucket and foo1) and the last level (the file name).

perl -F/ -lane'
   ++$f;
   ++$d{ $F[$_] } for 2 .. $#F - 1;
   END {
      print "Number of files: ".( $f // 0 );
      print "Number of dirs: ".( keys(%d) // 0 );
   }
'

Output:

Number of files: 7
number of dirs: 9

Specifying file to process to Perl one-liner

Upvotes: 0

Dudi Boy
Dudi Boy

Reputation: 4865

Your question is not clear.

If we count unique relatives folder paths in the list provided there are 12:

my-bucket/foo1/foo2/foo3/foo4/foo5/foo6/foo7
my-bucket/foo1/foo2/foo3/foo4/foo5/foo6
my-bucket/foo1/foo2/foo3/foo4/foo6
my-bucket/foo1/foo2/foo3/foo4/foo5
my-bucket/foo1/foo2/foo3/foo4
my-bucket/foo1/foo2/foo3
my-bucket/foo1/foo2
my-bucket/foo1/foo8
my-bucket/foo1/foo9/foo10
my-bucket/foo1/foo9
my-bucket/foo1
my-bucket

The awk script to count this is:

BEGIN {FS = "/";} # set field deperator to "/"
{  # for each input line
  commulativePath = OFS = ""; # reset commulativePath and OFS (Output Field Seperator) to ""
  for (i = 1; i < NF; i++) { # loop all folders up to file name
    if (i > 1) OFS = FS; # set OFS to "/" on second path
    commulativePath = commulativePath OFS $i;  # append current field to commulativePath variable
    dirs[commulativePath] = 0; # insert commulativePath into an associative array dirs
  }
}
END {
  print NR " " length(dirs); # print records count, and associative array dirs length
}

If we count unique folder names there are 11:

my-bucket
foo1
foo2
foo3
foo4
foo5
foo6
foo7
foo8
foo9
foo10

The awk script to count this is:

awk -F'/' '{for(i=1;i<NF;i++)dirs[$i]=1;}END{print NR " " length(dirs)}' input.txt

Upvotes: 1

Related Questions