Reputation: 1338
Kiba is a very small library, and it is my understanding that most of its value is derived from enforcing a modular architecture of small independent transformations.
However, it seems to me that the model of a series of serial transformations does not fit most of the ETL problems we face. To explain the issue, let me give a contrived example:
A source yields hashes with the following structure
{ spend: 3, cost: 7, people: 8, hours: 2 ... }
Our prefered output is a list of hashes where some of the keys might be the same as those from the source, though the values might differ
{ spend: 8, cost: 10, amount: 2 }
Now, calculating the resulting spend requires a series of transformations: ConvertCurrency, MultiplyByPeople
etc. etc. And so does calculating the cost: ConvertCurrencyDifferently, MultiplyByOriginalSpend
.. Notice that the cost calculations depend on the original (non transformed) spend value.
The most natural pattern would be to calculate the spend and cost in two independent pipelines, and merge the final output. A map-reduce pattern if you will. We could even benefit from running the pipelines in parallel.
However in my case it is not really a question of performance (as the transformations are very fast). The issue is that since Kiba applies all transforms as a set of serial steps, the cost calculations will be affected by the spend calculations, and we will end up with the wrong result.
Does kiba have a way of solving this issue? The only thing I can think of is to make sure that the destination names are not the same as the source names, e.g. something like 'originSpend' and 'finalSpend'. It still bothers me however that my spend calculation pipeline will have to make sure to pass on the full set of keys for each step, rather than just passing the key relevant to it, and then merging in the Cost keys in the end. Or perhaps one can define two independent kiba jobs, and have a master job call the two and merge their result in the end? What is the most kiba-idiomatic solution to this?
Splitting an ETL pipeline into multiple parallel paths seem to be a key feature of most ETL tools, so I'm surprised that it doesn't seem to be something kiba supports?
Upvotes: 0
Views: 86
Reputation: 8873
I think I lack extra details to be able to properly answer your main question. I will get in touch via email for this round, and will maybe comment here later for public visibility.
Splitting an ETL pipeline into multiple parallel paths seem to be a key feature of most ETL tools, so I'm surprised that it doesn't seem to be something kiba supports?
The main focus of Kiba ETL today is: components reuse, lower maintenance cost, modularity and ability to have a strong data & process quality.
Parallelisation is supported to some extent though, via different patterns.
If your main input is something that you can manage to "partition" with a low volume of items (e.g. database id ranges, or a list of files), you can use Kiba Pro parallel transform like this:
source ... # something that generate list of work items
parallel_transform(max_threads: 10) do |group_items|
Kiba.run(...)
end
This works well if there is no output at all, or not much output, coming to the destinations of the sister jobs.
This works with threads but one can also "fork" here for extra performance.
In a similar fashion, one can structure their jobs in a way where each process will only process a subset of the input data.
This way one can start say 4 processes (via cron jobs, or monitored via a parent tool), and pass a SHARD_NUMBER=1,2,3,4, which is then used by the source for input-load partitioning.
I'm pretty sure your problem, as you said, is more about workflow control & declarations & ability to express what you need to be done, rather than performance.
I'll reach out and we'll discuss that.
Upvotes: 1