Reputation: 1459
What is the most efficient way to get the count on the number of blobs in an Azure Storage container?
Right now I can't think of any way other than the code below:
CloudBlobContainer container = GetContainer("mycontainer");
var count = container.ListBlobs().Count();
Upvotes: 39
Views: 66028
Reputation: 8824
This answer is for someone with large blob storage with millions of blobs.
The top-rated answer on this thread is pretty much unusable with large blob storages. The azure storage explorer application simply calls list blobs API under the hood which is paginated and allows 5000 records at a time. In case you have millions of blobs, this will take forever to return the blob count.
If you are ok with approximate value, then the storage browser option in azure portal is extremely useful. However, note that this value is not very accurate on blob storages that have high write/delete operations.
Also, this data should be visible by default. If not, enable the diagnostics metrics. Monitoring -> Diagnostic Settings(classic). (Turn the status on and enable the hour metrics)
If you want more accurate results, then the only option is to enable blob storage inventory report. The downside is that this is a background job, and the report can be generated only once per day. Here is the document on the same. For large blob storage's, my suggestion is to generate a parquet report every day and when you need to inspect/read the report, either use Dbeaver(along with DuckDB) or Databricks or Synapse. Below listed few resources on how this can be achieved.
If you do not wish to use inventory report, here is a PowerShell script to achieve something similar. However, this can take many hours to return blob count on large blob storages.
Upvotes: 1
Reputation: 934
With azure-cli
it would be as follow:
az storage blob list --account-name <name> --container-name <name> --num-results "*" --query "length(@)"
Upvotes: 1
Reputation: 2773
Bearing in mind all the performance concerns from the other answers, here is a version for v12 of the Azure SDK leveraging IAsyncEnumerable
. This requires a package reference to System.Linq.Async.
public async Task<int> GetBlobCount()
{
var container = await GetBlobContainerClient();
var blobsPaged = container.GetBlobsAsync();
return await blobsPaged
.AsAsyncEnumerable()
.CountAsync();
}
Upvotes: 2
Reputation: 1
You can use this
public static async Task<List<IListBlobItem>> ListBlobsAsync()
{
BlobContinuationToken continuationToken = null;
List<IListBlobItem> results = new List<IListBlobItem>();
do
{
CloudBlobContainer container = GetContainer("containerName");
var response = await container.ListBlobsSegmentedAsync(null,
true, BlobListingDetails.None, 5000, continuationToken, null, null);
continuationToken = response.ContinuationToken;
results.AddRange(response.Results);
} while (continuationToken != null);
return results;
}
and then call
var count = await ListBlobsAsync().Count;
hope it will be useful
Upvotes: 0
Reputation: 2654
List blobs approach is accurate but slow if you have millions of blobs. Another way that works in a few cases but is relatively fast is querying the MetricsHourPrimaryTransactionsBlob
table. It is at the account level and metrics get aggregated hourly.
https://learn.microsoft.com/en-us/azure/storage/common/storage-analytics-metrics
Upvotes: 0
Reputation: 345
If you are using Azure.Storage.Blobs library, you can use something like below:
public int GetBlobCount(string containerName)
{
int count = 0;
BlobContainerClient container = new BlobContainerClient(blobConnctionString, containerName);
container.GetBlobs().ToList().ForEach(blob => count++);
return count;
}
Upvotes: 1
Reputation: 4530
Count all blobs in a classic and new blob storage account. Building on @gandikota-saikoushik, this solution works for blob containers with a very large number of blobs.
//setup set values from Azure Portal
var accountName = "<ACCOUNTNAME>";
var accountKey = "<ACCOUTNKEY>";
var containerName = "<CONTAINTERNAME>";
uristr = $"DefaultEndpointsProtocol=https;AccountName={accountName};AccountKey={accountKey}";
var storageAccount = Microsoft.WindowsAzure.Storage.CloudStorageAccount.Parse(uristr);
var client = storageAccount.CreateCloudBlobClient();
var container = client.GetContainerReference(containerName);
BlobContinuationToken continuationToken = new BlobContinuationToken();
blobcount = CountBlobs(container, continuationToken).ConfigureAwait(false).GetAwaiter().GetResult();
Console.WriteLine($"blobcount:{blobcount}");
public static async Task<int> CountBlobs(CloudBlobContainer container, BlobContinuationToken currentToken)
{
BlobContinuationToken continuationToken = null;
var result = 0;
do
{
var response = await container.ListBlobsSegmentedAsync(continuationToken);
continuationToken = response.ContinuationToken;
result += response.Results.Count();
}
while (continuationToken != null);
return result;
}
Upvotes: 0
Reputation: 103
I have spend quite period of time to find the below solution - I don't want to some one like me to waste time - so replying here even after 9 years
package com.sai.koushik.gandikota.test.app;
import com.microsoft.azure.storage.CloudStorageAccount;
import com.microsoft.azure.storage.blob.*;
public class AzureBlobStorageUtils {
public static void main(String[] args) throws Exception {
AzureBlobStorageUtils getCount = new AzureBlobStorageUtils();
String storageConn = "<StorageAccountConnection>";
String blobContainerName = "<containerName>";
String subContainer = "<subContainerName>";
Integer fileContainerCount = getCount.getFileCountInSpecificBlobContainersSubContainer(storageConn,blobContainerName, subContainer);
System.out.println(fileContainerCount);
}
public Integer getFileCountInSpecificBlobContainersSubContainer(String storageConn, String blobContainerName, String subContainer) throws Exception {
try {
CloudStorageAccount storageAccount = CloudStorageAccount.parse(storageConn);
CloudBlobClient blobClient = storageAccount.createCloudBlobClient();
CloudBlobContainer blobContainer = blobClient.getContainerReference(blobContainerName);
return ((CloudBlobDirectory) blobContainer.listBlobsSegmented().getResults().stream().filter(listBlobItem -> listBlobItem.getUri().toString().contains(subContainer)).findFirst().get()).listBlobsSegmented().getResults().size();
} catch (Exception e) {
throw new Exception(e.getMessage());
}
}
}
Upvotes: 0
Reputation: 41
Another Python example, works slow but correctly with >5000 files:
from azure.storage.blob import BlobServiceClient
constr="Connection string"
container="Container name"
blob_service_client = BlobServiceClient.from_connection_string(constr)
container_client = blob_service_client.get_container_client(container)
blobs_list = container_client.list_blobs()
num = 0
size = 0
for blob in blobs_list:
num += 1
size += blob.size
print(blob.name,blob.size)
print("Count: ", num)
print("Size: ", size)
Upvotes: 0
Reputation: 721
If you just want to know how many blobs are in a container without writing code you can use the Microsoft Azure Storage Explorer application.
Upvotes: 48
Reputation: 5620
I tried counting blobs using ListBlobs() and for a container with about 400,000 items, it took me well over 5 minutes.
If you have complete control over the container (that is, you control when writes occur), you could cache the size information in the container metadata and update it every time an item gets removed or inserted. Here is a piece of code that would return the container blob count:
static int CountBlobs(string storageAccount, string containerId)
{
CloudStorageAccount cloudStorageAccount = CloudStorageAccount.Parse(storageAccount);
CloudBlobClient blobClient = cloudStorageAccount.CreateCloudBlobClient();
CloudBlobContainer cloudBlobContainer = blobClient.GetContainerReference(containerId);
cloudBlobContainer.FetchAttributes();
string count = cloudBlobContainer.Metadata["ItemCount"];
string countUpdateTime = cloudBlobContainer.Metadata["CountUpdateTime"];
bool recountNeeded = false;
if (String.IsNullOrEmpty(count) || String.IsNullOrEmpty(countUpdateTime))
{
recountNeeded = true;
}
else
{
DateTime dateTime = new DateTime(long.Parse(countUpdateTime));
// Are we close to the last modified time?
if (Math.Abs(dateTime.Subtract(cloudBlobContainer.Properties.LastModifiedUtc).TotalSeconds) > 5) {
recountNeeded = true;
}
}
int blobCount;
if (recountNeeded)
{
blobCount = 0;
BlobRequestOptions options = new BlobRequestOptions();
options.BlobListingDetails = BlobListingDetails.Metadata;
foreach (IListBlobItem item in cloudBlobContainer.ListBlobs(options))
{
blobCount++;
}
cloudBlobContainer.Metadata.Set("ItemCount", blobCount.ToString());
cloudBlobContainer.Metadata.Set("CountUpdateTime", DateTime.Now.Ticks.ToString());
cloudBlobContainer.SetMetadata();
}
else
{
blobCount = int.Parse(count);
}
return blobCount;
}
This, of course, assumes that you update ItemCount/CountUpdateTime every time the container is modified. CountUpdateTime is a heuristic safeguard (if the container did get modified without someone updating CountUpdateTime, this will force a re-count) but it's not reliable.
Upvotes: 16
Reputation: 819
If you are not using virtual directories, the following will work as previously answered.
CloudBlobContainer container = GetContainer("mycontainer");
var count = container.ListBlobs().Count();
However, the above code snippet may not have the desired count if you are using virtual directories.
For instance, if your blobs are stored similar to the following: /container/directory/filename.txt where the blob name = directory/filename.txt the container.ListBlobs().Count(); will only count how many "/directory" virtual directories you have. If you want to list blobs contained within virtual directories, you need to set the useFlatBlobListing = true in the ListBlobs() call.
CloudBlobContainer container = GetContainer("mycontainer");
var count = container.ListBlobs(null, true).Count();
Note: the ListBlobs() call with useFlatBlobListing = true is a much more expensive/slow call...
Upvotes: 2
Reputation: 2399
With Python API of Azure Storage it is like:
from azure.storage import *
blob_service = BlobService(account_name='myaccount', account_key='mykey')
blobs = blob_service.list_blobs('mycontainer')
len(blobs) #returns the number of blob in a container
Upvotes: 1
Reputation: 31
Example using PHP API and getNextMarker.
Counts total number of blobs in an Azure container. It takes a long time: about 30 seconds for 100000 blobs.
(assumes we have a valid $connectionString and a $container_name)
$blobRestProxy = ServicesBuilder::getInstance()->createBlobService($connectionString);
$opts = new ListBlobsOptions();
$nblobs = 0;
while($cont) {
$blob_list = $blobRestProxy->listBlobs($container_name, $opts);
$nblobs += count($blob_list->getBlobs());
$nextMarker = $blob_list->getNextMarker();
if (!$nextMarker || strlen($nextMarker) == 0) $cont = false;
else $opts->setMarker($nextMarker);
}
echo $nblobs;
Upvotes: 3
Reputation: 71121
The API doesn't contain a container count method or property, so you'd need to do something like what you posted. However, you'll need to deal with NextMarker if you exceed 5,000 items returned (or if you specify max # to return and the list exceeds that number). Then you'll make add'l calls based on NextMarker and add the counts.
EDIT: Per smarx: the SDK should take care of NextMarker for you. You'll need to deal with NextMarker if you're working at the API level, calling List Blobs through REST.
Alternatively, if you're controlling the blob insertions/deletions (through a wcf service, for example), you can use the blob container's metadata area to store a cached container count that you compute with each insert or delete. You'll just need to deal with write concurrency to the container.
Upvotes: 11