Reputation: 91
I am new to the Google Cloud and to Dataflow. What I want to do is to create a tool which detects and recovers errors in (large) csv files. However at design time, not every error that should be dealt with is already known. Therefore, I need an easy way to add new functions that handle a specific error.
The tool should be something like a framework for automatically creating a dataflow template based on a user selection. I already thought of a workflow which could work, but as mentioned before I am totally new to this, so please feel free to suggest a better solution:
To achieve the extensibility, new functions which implement the error corrections should easily be added. What I thought of was:
However my problem is that I am not sure if this is possible or a "good" solution for this problem. Furthermore, I don't know how to create a python script that uses functions which are not known at design time. (I thought of using something like a strategy pattern, but as far as I know you still need to have the functions implemented at design time already, even though the decision which function to use is being made during run time) Any help would be greatly appreciated!
Upvotes: 1
Views: 611
Reputation: 3883
What you can use in your architecture is Cloud Functions together with Cloud Composer (hosted solution for Airflow). Apache Airflow is designed to run DAGs on a regular schedule, but you can also trigger DAGs in response to events, such as a change in a Cloud Storage bucket (where CSV files can be stored). You can configure this with your frontend and every time new files arrives into the Bucket, the DAG containing step by step process is being triggered.
Please, have a look for the official documentation, which describes the process of launching Dataflow pipelines with Cloud Composer using DataflowTemplateOperator
.
Upvotes: 1