Reputation: 3
I'm working on an ETL pipeline with Kiba which imports into multiple, related models in my Rails app. For example, I have records
which have many images
. There might also be collections
which contain many records
.
The source of my data will be various, including HTTP APIs and CSV files. I would like to make the pipeline as modular and reusable as possible, so for each new type of source, I only have to create the source, and the rest of the pipeline definition is the same.
Given multiple models in the destination, and possibly several API calls to get the data from the source, what's the standard pattern for this in Kiba?
I could create one pipeline where the destination is 'the application' and has responsibility for all these models, this feels like the wrong approach because the destination would be responsible for saving data across different Rails models, uploading images etc.
Should I create one master pipeline which triggers more specific ones, passing in a specific type of data (e.g. image URLs for import)? Or is there a better approach than this?
Thanks.
Upvotes: 0
Views: 92
Reputation: 8873
Kiba author here!
It is natural & common to look for some form of genericity, modularity and reusability in data pipelines. I would say though, that like for regular code, it can be hard initially to figure out what is the correct way to get that (it will depend quite a bit on your exact situation).
This is why my recommendation would be instead to:
webmock
or similar to stub out API requests & make tests completely isolated, create tests with 1 row from source to destination) - this will make it easy to refactor stuff laterDepending on your exact situation, maybe you will extract specific components, or maybe you will end up extracting a whole generic job, or generic families of jobs etc.
This approach works well even as you get more experience working with Kiba (this is how I gradually extracted the components that you will find in kiba-common and kiba-pro, too.
Upvotes: 1