Reputation: 167
We know U-SQL supports directory and filename pattern matching while extracting the files. What I wanted to know does it support pattern matching based on date of creation of the file in ADLS (without implementing custom extractors).
Say a folder contains files created across months (filenames don't have date as part of the filename), is there a way to pull only files of a particular month.
Upvotes: 2
Views: 752
Reputation: 989
it's been a while since this question was asked, and I'm not sure if this is what you were looking for originally, but now you can use the FILE.MODIFIED
U-SQL function:
DECLARE @watermark string = "2018-08-16T18:12:03";
SET @@FeaturePreviews="InputFileGrouping:on";
DECLARE @file_set_path string = "adl://adls.azuredatalakestore.net/stage/InputSample.tsv";
@input =
EXTRACT [columnA] int?,
[columnB] string
FROM @file_set_path
USING Extractors.Tsv(skipFirstNRows : 1, silent : true);
@result =
SELECT *, FILE.MODIFIED(@file_set_path) AS FileModifiedDate
FROM @input
WHERE FILE.MODIFIED(@file_set_path) > DateTime.ParseExact(@watermark, "yyyy-MM-ddTHH:mm:ss", NULL);
OUTPUT @result TO "adl://ADLS.azuredatalakestore.net/stage/OutputSample.tsv" USING Outputters.Tsv(outputHeader:true);
The U-SQL built-in function is documented here: https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/file-modified-u-sql
Upvotes: 0
Reputation: 657
You could probably build a solution using the .NET SDK. For something rather simple you could use PowerShell to create a file which will contain all the files that meet your date time criteria. Then consume the content as desired.
# Log in to your Azure account
Login-AzureRmAccount
# Modify variables as required
$DataLakeStoreAccount = "<yourDataLakeStoreAccountNameHere>";
$DataLakeAnalyticsAccount = <yourDataLakeAnalyticsAccountNameHere>";
$DataLakeStorePath = "/Samples/Data/AmbulanceData/"; #modify as desired
$outputFile = "Samples/Outputs/ReferenceGuide/filteredFiles.csv"; #modify as desired
$filterDate = "2016-11-22";
$jobName = "GetFiles";
# Query directory and build main body of script. Note, there is a csv filter.
[string]$body =
"@initial =
SELECT * FROM
(VALUES
" +
(Get-AzureRmDataLakeStoreChildItem -Account $DataLakeStoreAccount -Path $DataLakeStorePath |
Where {$_.Name -like "*.csv" -and $_.Type -eq "FILE"} | foreach {
"(""" + $DataLakeStorePath + $_.Name + """, (DateTime)FILE.CREATED(""" + $DataLakeStorePath + $_.Name + """)), `r`n" });
# formattig, add column names
$body =
$body.Substring(0,$body.Length-4) + "
) AS T(fileName, createDate);";
# U-SQL query and OUTPUT statement
[string]$output =
"
// filter results based on desired time frame
@filtered =
SELECT fileName
FROM @initial
WHERE createDate.ToString(""yyyy-MM-dd"") == ""$filterDate"";
OUTPUT @filtered
TO ""$outputFile""
USING Outputters.Csv();";
# bring it all together
$script = $body + $output;
#Execute job
$jobInfo = Submit-AzureRmDataLakeAnalyticsJob -Account $DataLakeAnalyticsAccount -Name $jobName -Script $script -DegreeOfParallelism 1
#check job progress
Get-AzureRmDataLakeAnalyticsJob -Account $DataLakeAnalyticsAccount -JobId $jobInfo.JobId -ErrorAction SilentlyContinue;
Write-Host "You now have a list of desired files to check @ " $outputFile
Upvotes: 1
Reputation: 6684
Currently there is no way to access or use file meta data properties. Please add your vote and use case to the following feedback item: https://feedback.azure.com/forums/327234-data-lake/suggestions/10948392-support-functionality-to-handle-file-properties-fr
Upvotes: 0
Reputation: 8619
The U-SQL EXTRACT operator is not aware of any metadata (such as create date) about a file - only the filename.
Upvotes: 1