Reputation: 81
I have a unbounded collection which is read from PubsubIO with data, called Trade
format like
{
timestamp: 123,
type: "",
side: "" // sell or buy
volume: 123.12,
location: ""
}
There are hundreds of type
s and above 40 number of location
s and their relation are n <=> n
.
My task are calculating the total of volume of trades in 10 mins and 60 mins category by side
, type
and location
, also calculate total volume base on type. So, the output should be 4 collections of something, each for 10 mins and 60 min and for both sell and buy, called TotalTrade
,like
{
total: 123,
type: "",
location: "",
}
What have I tried so far is.
for each collection I process
type
Trade
KV<String, Iterable<Trade>>
Iterable<Trade>
so the output is KV<String, Iterable<KV<String, TotalTrade>>>
The problem is in the custom Pardo step. I have to manual group Trade
s by location, calculate the total then output the result. Which is, for me, is not embracing the parallel model of Apache Beam or Google Dataflow.
So my question is Is there any way to branch a collection into dynamic number collections in Beam model. For example, my problem could be solve by the following transforms.
type
of Trade
location
TotalTrade
So now we have TotalTrade
category by location
and type
So now we have total volume base on type
Upvotes: 0
Views: 2049
Reputation: 309
It's not possible to branch a collection into a dynamic number of collections if the dynamic number is not available during pipeline creation. The graph / steps are set at the beginning of a pipeline and cannot change.
If you had a lot of dynamic numbers you can try putting out results with and grouping by id. However you would get some hot keys (all of id would have to be processed by 1 worker) if you do not have many ids and do have a lot of values.
Upvotes: 1