Reputation: 85
Within my Azure Data Lake directory ingest
, I have timestamp-named subdirectories:
ingest/1510395023/file.tsv
ingest/1510425023/file.tsv
ingest/1510546210/file.tsv
...
In an Azure Data Factory v2 copy activity (Data Lake to SQL Server), how can I filter the Data Lake directories to those greater than a specified timestamp?
Assume I will record the already-copied timestamps into a SQL Server, and on the next pipeline run, I only want to copy incremental new directories based on a Lookup Activity.
Data Factory documentation references logical functions, but I don't understand how to implement them in this scenario.
Alternatively: If a logical comparison filter is available on the file name, as opposed to the directories, that would be helpful as well.
Note: I want to load in historical data as well, so only filtering a timestamp greater or equal to now
is insufficient.
Upvotes: 1
Views: 939
Reputation: 81
I assume you have a Data Lake Analytics account.
Using a U-SQL activity, you can copy the targeted files to a single file, which you can then copy using a copy activity.
You can accomplish this by extracting a file set and filtering it by its virtual column.
Let @timestamp
string be the input parameter, which was obtained via a lookup activity and which is the latest timestamp that is already processed:
@delta =
SELECT Col1, Col2
FROM(
EXTRACT
[Col1] string,
[Col2] string,
[Timestamp] string
FROM "ingest/{Timestamp}/file.tsv"
USING Extractors.Tsv(skipFirstNRows : 1/* change if the header is not part of the file*/)
) AS T
WHERE long.Parse([Timestamp]) > long.Parse(@timestamp);
OUTPUT @delta
TO "delta/file.tsv"
USING Outputters.Tsv(outputHeader : true);
From there you can copy "delta/file.tsv" to your database.
Upvotes: 2