Reputation: 582
I'm developing one pipeline that reads data from Kafka.
The source kafka topic is quite big in terms of traffic, there are 10k messages inserted per second and each of the message is around 200kB
I need to filter the data in order to apply the transformations that I need but something I'm sure is if there is an order in which I need to apply the filter and window functions.
read->window->filter->transform->write
would be more efficient than
read->filter->window->transform->write
or it would be the same performance both options?
I know that samza is just a model that only tells the what and not the how and the runner optimizes the pipeline but I just want to be sure I got it correct
Thanks
Upvotes: 0
Views: 47
Reputation: 5104
If there is substantial filtering, windowing after the filter will technically reduce the amount of work performed, though that saved work is cheap enough that I doubt it'd make a measurable difference. (Presumably the runner could notice that the filter does not observe the assigned window and lift it in that case, but as mentioned it's unclear if there are really savings to be gained here...)
Upvotes: 0