akhil regonda
akhil regonda

Reputation: 159

Fastest way to get the files count and total size of a folder in GCS?

Assume there is bucket with a folder root, it has subfolders and files. Is there any way to get the total files count and total size of the root folder?

What I tried: With gsutil du I'm getting the size quickly but won't the get count. With gsutil ls ___ I'm getting list and size, if I pipe it with awk and sum them. I might get the expected result but ls itself is taking lot of time.

So is there a better/faster way to handle this?

Upvotes: 4

Views: 8208

Answers (3)

Axel Borja
Axel Borja

Reputation: 3974

  1. Using gsutil du -sh, which could be a good idea for small directories.

For big directories, I am not able to get a result, even after a few hours, but only the following retrying message: enter image description here


  1. Using gsutil ls which is more efficient.

For big directories, it could take tens of minutes, but at least it complete.

To retrieve the number of files and the total size of a directory with gsutil ls, you can use the following command:

gsutil ls -l gs://bucket/dir/** | awk '{size+=$1} END {print "nb_files:", NR, "\ntotal_size:",size,"B"}'

Then divide the value by:

  • 1024 for KB
  • 1024 * 1024 for MB
  • 1024 * 1024 * 1024 for GB
  • ...

Example: enter image description here

Upvotes: 0

mhouglum
mhouglum

Reputation: 2593

Doing an object listing of some sort is the way to go - both the ls and du commands in gsutil perform object listing API calls under the hood.

If you want to get a summary of all objects in a bucket, check Cloud Monitoring (as mentioned in the docs). But, this isn't applicable if you want statistics for a subset of objects - GCS doesn't support actual "folders", so all your objects under the "folder" foo are actually just objects named with a common prefix, foo/.

If you want to analyze the number of objects under a given prefix, you'll need to perform object listing API calls (either using a client library or using gsutil). The listing operations can only return so many objects per response and thus are paginated, meaning you'll have to make several calls if you have lots of objects under the desired prefix. The max number of results per listing call is currently 1,000. So as an example, if you had 200,000 objects to list, you'd have to make 200 sequential API calls.

A note on gsutil's ls:

There are several scenarios in which gsutil can do "extra" work when completing an ls command, like when doing a "long" listing using the -L flag or performing recursive listings using the -r flag. To save time and perform the fewest number of listings possible in order to obtain a total count of bytes under some prefix, you'll want to do a "flat" listing using gsutil's wildcard support, e.g.:

gsutil ls -l gs://my-bucket/some-prefix/**

Alternatively, you could try writing a script using one of the GCS client libraries, like the Python library and its list_blobs functionality.

Upvotes: 3

Brandon Yarbrough
Brandon Yarbrough

Reputation: 38389

If you want to track the count of objects in a bucket over a long time, Cloud Monitoring offers the metric "storage/object_count". The metric updates about once per day, which makes it more useful for long-term trends.

As for counting instantaneously, unfortunately gsutil ls is probably your best bet.

Upvotes: 2

Related Questions