Reputation: 75316
I am trying grasp the concept of Data Factory to understand how schedule activity works, but does not really understand much.
Assume I have workflow as below:
I have an agent (built as Windows Service) running on client's machine which is scheduled to extract data from SAP source daily at 1 AM, and then put it on Azure blob storage. Agent just tries to extract only yesterday's data. Example: Agent running at 1 AM today (9 April) only extract whole data on 8 April. This agent is not related to Data Factory.
Assume it takes around 30 minutes for agent to get daily data (8 April) and put it in blob storage, it may be more or less depending on how big data is.
I have a Factory Pipepine (active forever from 2016-04-08T01:30:00Z) which uses blob storage as input dataset and 1 schedule activity to copy data from blob storage to database.
Input dataset has availability option as daily frequency:
"availability": {
"frequency": "Day",
"interval": 1
}
Schedule activity is scheduled as daily frequency:
"scheduler": {
"frequency": "Day",
"interval": 1
}
So, based on the workflow, my questions are:
After 1:30 AM, the agent finish data extraction from SAP and put it into blog storage as input dataset. How the data factory knows the data slice for 8 April is ready for data factory.
What if the data is not ready after 1:30, the activity is still running at this time?
Upvotes: 1
Views: 3947
Reputation: 1738
If I understand your particular scenario correctly, and you have access to modify the code of the windows service you can have your windows service kick off the ADF pipeline when it is complete. I am doing something exactly like this and I need to control when my pipeline begins. I have a local job pulling data from a few data sources and putting it into an azure sql db. Once that is complete I need my pipeline to start however there was no way for me to know exactly when my job was going to complete. So the final step of my local job is to kick off my ADF pipeline. I have a write up on how to do it here - Starting an azure data factory pipeline from .net.
Hope this helps.
Upvotes: 1
Reputation: 490
If you have data in Azure Blob Storage that is appearing daily, you can try using Date folders (eg: .../yyyy/MM/dd/...). Data Factory can detect whether a particular date folder exists to determine whether the slice for the particular day is ready for processing or not. If Data Factory doesn't see the folder for that day, it will not execute the pipeline for that slice.
I would also suggest including the extraction process as a part of the Data Factory processing so that if the extraction fails, the pipeline will not be executed further.
I hope this helps!
Upvotes: 1
Reputation: 164
To my knowledge, Azure Data Factory does not currently support triggering a pipeline by creating or updating a blob.
For this workflow, the solution is to schedule the input dataset based on time. If you're confident that the data extraction will be complete by 1:30 AM, then you can schedule the job to run daily at 1:30 AM (or perhaps a little bit later, in case the extraction runs long.) To do this, set your pipeline's start time to be something like "2016-04-08T01:30:00Z" (for UTC time.) You should be able to author the input dataset in such a way that the job will fail if the data extraction is not yet complete, which would allow you to notice the failure and rerun it. The activity will start when you schedule it to, and will complete as soon as possible. See this page for details on moving data between Azure Blob and Azure SQL. Your workflow would look very similar to the example at that link, only your frequency would be be "Day".
Depending on how the local data is stored, it may be worth looking into moving the data directly from your on-prem source, bypassing Azure Blob. This is supported using a Data Management Gateway, as documented here. Unfortunately, I'm not familiar with SAP, so I can't offer more information about that.
Upvotes: 0