Reputation: 4845
I want to use DataLakeFileSystemClient
to query paths by Regex. Unfortunately the only way I've been able to figure it out so far is to traverse every single path with a prefix, then use Regex after the fact to check if the item matches. Is there a better way to do this?
private async static IAsyncEnumerable<string> TraverseDirectories(DataLakeFileSystemClient fileSystemClient,
string directoryPath, string filePattern, [EnumeratorCancellation] CancellationToken cancellationToken)
{
cancellationToken.ThrowIfCancellationRequested();
// List all paths (files and directories) in the current directory
await foreach (PathItem pathItem in fileSystemClient.GetPathsAsync(directoryPath, recursive: true, cancellationToken: cancellationToken))
{
cancellationToken.ThrowIfCancellationRequested();
if (pathItem.IsDirectory.HasValue && pathItem.IsDirectory.Value)
continue;
// Match files using a wildcard pattern
if (Regex.IsMatch(pathItem.Name, filePattern))
{
yield return pathItem.Name;
}
}
}
Upvotes: 0
Views: 85
Reputation: 10455
Azure .NET SDK - Query paths using DataLakeFileSystemClient using Regex.
According to this MS-Document,
The DataLakeFileSystemClient
API currently does not directly support for querying paths using regex patterns. Instead, the API enables listing paths by filtering with parameters like prefix matching.
However, the approach of traversing the directory and filtering results client-side
using regex is a valid workaround.
Code:
using Azure.Storage.Files.DataLake;
using Azure.Storage.Files.DataLake.Models;
using Azure.Identity;
using System.Text.RegularExpressions;
namespace DataLakeRegexExample
{
public class DataLakeFileSearcher
{
private readonly DataLakeFileSystemClient _fileSystemClient;
public DataLakeFileSearcher(DataLakeFileSystemClient fileSystemClient)
{
_fileSystemClient = fileSystemClient;
}
public async IAsyncEnumerable<string> TraverseDirectoriesAsync(
string directoryPath,
string filePattern,
[System.Runtime.CompilerServices.EnumeratorCancellation] CancellationToken cancellationToken = default)
{
cancellationToken.ThrowIfCancellationRequested();
// List all paths (files and directories) in the current directory with a specific prefix
await foreach (PathItem pathItem in _fileSystemClient.GetPathsAsync(directoryPath, recursive: true, cancellationToken: cancellationToken))
{
cancellationToken.ThrowIfCancellationRequested();
if (pathItem.IsDirectory.HasValue && pathItem.IsDirectory.Value)
continue;
if (Regex.IsMatch(pathItem.Name, filePattern))
{
yield return pathItem.Name; // Yield matching file names
}
}
}
public static async Task Main(string[] args)
{
string accountName = "xxxx"; // Replace with your account name
string fileSystemName = "xxx"; // Replace with your file system name
string directoryPath = "xxxx"; // Replace with your directory path
string filePattern = @"^.*\.csv$"; // sample Regex pattern for query only.csv files
var serviceClient = new DataLakeServiceClient(
new Uri($"https://{accountName}.dfs.core.windows.net"),
new DefaultAzureCredential()
);
var fileSystemClient = serviceClient.GetFileSystemClient(fileSystemName);
var searcher = new DataLakeFileSearcher(fileSystemClient);
try
{
await foreach (var fileName in searcher.TraverseDirectoriesAsync(directoryPath, filePattern))
{
Console.WriteLine(fileName);
}
}
catch (Exception ex)
{
Console.WriteLine($"Error occurred: {ex.Message}");
}
}
}
}
The above code defines a DataLakeFileSearcher
class that utilizes a DataLakeFileSystemClient
to traverse directories and match file names against a regex pattern. Since direct regex querying isn't supported, this approach effectively filters paths on the client side after listing them.
Output:
backup/file1.csv
backup/large_file.csv
Upvotes: 1