codeMonkey
codeMonkey

Reputation: 4845

Azure .NET SDK - Query paths using DataLakeFileSystemClient using Regex

I want to use DataLakeFileSystemClient to query paths by Regex. Unfortunately the only way I've been able to figure it out so far is to traverse every single path with a prefix, then use Regex after the fact to check if the item matches. Is there a better way to do this?

private async static IAsyncEnumerable<string> TraverseDirectories(DataLakeFileSystemClient fileSystemClient, 
    string directoryPath, string filePattern, [EnumeratorCancellation] CancellationToken cancellationToken)
{
    cancellationToken.ThrowIfCancellationRequested();
    // List all paths (files and directories) in the current directory
    await foreach (PathItem pathItem in fileSystemClient.GetPathsAsync(directoryPath, recursive: true, cancellationToken: cancellationToken))
    {
        cancellationToken.ThrowIfCancellationRequested();
        if (pathItem.IsDirectory.HasValue && pathItem.IsDirectory.Value)
            continue;

        // Match files using a wildcard pattern
        if (Regex.IsMatch(pathItem.Name, filePattern))
        {
            yield return pathItem.Name;
        }
    }
}

Upvotes: 0

Views: 85

Answers (1)

Venkatesan
Venkatesan

Reputation: 10455

Azure .NET SDK - Query paths using DataLakeFileSystemClient using Regex.

According to this MS-Document,

The DataLakeFileSystemClient API currently does not directly support for querying paths using regex patterns. Instead, the API enables listing paths by filtering with parameters like prefix matching.

However, the approach of traversing the directory and filtering results client-side using regex is a valid workaround.

Code:

using Azure.Storage.Files.DataLake;
using Azure.Storage.Files.DataLake.Models;
using Azure.Identity;
using System.Text.RegularExpressions;


namespace DataLakeRegexExample
{
    public class DataLakeFileSearcher
    {
        private readonly DataLakeFileSystemClient _fileSystemClient;

        public DataLakeFileSearcher(DataLakeFileSystemClient fileSystemClient)
        {
            _fileSystemClient = fileSystemClient;
        }

        public async IAsyncEnumerable<string> TraverseDirectoriesAsync(
            string directoryPath,
            string filePattern,
            [System.Runtime.CompilerServices.EnumeratorCancellation] CancellationToken cancellationToken = default)
        {
            cancellationToken.ThrowIfCancellationRequested();

            // List all paths (files and directories) in the current directory with a specific prefix
            await foreach (PathItem pathItem in _fileSystemClient.GetPathsAsync(directoryPath, recursive: true, cancellationToken: cancellationToken))
            {
                cancellationToken.ThrowIfCancellationRequested();
                if (pathItem.IsDirectory.HasValue && pathItem.IsDirectory.Value)
                    continue;
                if (Regex.IsMatch(pathItem.Name, filePattern))
                {
                    yield return pathItem.Name; // Yield matching file names
                }
            }
        }

        public static async Task Main(string[] args)
        {
            string accountName = "xxxx"; // Replace with your account name
            string fileSystemName = "xxx"; // Replace with your file system name
            string directoryPath = "xxxx"; // Replace with your directory path
            string filePattern = @"^.*\.csv$"; // sample Regex pattern for query only.csv files
            var serviceClient = new DataLakeServiceClient(
                new Uri($"https://{accountName}.dfs.core.windows.net"),
                new DefaultAzureCredential()
            );
            var fileSystemClient = serviceClient.GetFileSystemClient(fileSystemName);
            var searcher = new DataLakeFileSearcher(fileSystemClient);

            try
            {
                await foreach (var fileName in searcher.TraverseDirectoriesAsync(directoryPath, filePattern))
                {
                    Console.WriteLine(fileName);
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine($"Error occurred: {ex.Message}");
            }
        }
    }
}

The above code defines a DataLakeFileSearcher class that utilizes a DataLakeFileSystemClient to traverse directories and match file names against a regex pattern. Since direct regex querying isn't supported, this approach effectively filters paths on the client side after listing them.

Output:

backup/file1.csv
backup/large_file.csv

enter image description here

Upvotes: 1

Related Questions