Split single PCollection into multiple (dynamic number of) PCollections then do calculating on each collection

Question

I have a unbounded collection which is read from PubsubIO with data, called Trade format like

{
  timestamp: 123,
  type: "",
  side: "" // sell or buy
  volume: 123.12,
  location: ""
}

There are hundreds of types and above 40 number of locations and their relation are n <=> n.

My task are calculating the total of volume of trades in 10 mins and 60 mins category by side, type and location, also calculate total volume base on type. So, the output should be 4 collections of something, each for 10 mins and 60 min and for both sell and buy, called TotalTrade ,like

{
  total: 123,
  type: "",
  location: "",
}

What have I tried so far is.

Branch the collection into 2 collection base on what side of trade is

for each collection I process

Window the collection into fixed windows for 10 mins
ParDo into KV of type Trade
GroupByKey so we have collection of KV>
Apply a custom ParDo calculates the total of volume for each location in Iterable so the output is KV>>
...

The problem is in the custom Pardo step. I have to manual group Trades by location, calculate the total then output the result. Which is, for me, is not embracing the parallel model of Apache Beam or Google Dataflow.

So my question is Is there any way to branch a collection into dynamic number collections in Beam model. For example, my problem could be solve by the following transforms.

Transform the collection in to collections base on type of Trade
Transform each of these collections into collections base on location
Do Combine transform to calculate TotalTrade

So now we have TotalTrade category by location and type

Do Flatten transform on each set of collection from step 4.
Do Combine on each collection

So now we have total volume base on type

Split single PCollection into multiple (dynamic number of) PCollections then do calculating on each collection

Answers (1)

Related Questions