Reputation: 660
I am new to Hazelcast jet and I am trying to join to datas but I want to know how it will occupy memory and how much and what it best practice to do it?
Here is my code:
BatchStage<Map<String,Object>> batch1= pipeline.readFrom(companyListBatchSource);
BatchStage<Map<String,Object>> batch2= pipeline.readFrom(employeeListBatchSource);
//Getting group by key
BatchStageWithKey<Map<String,Object>, Object> jdbcGroupByKey =
batch1.groupingKey(a -> a.getSource1().get(col1));
BatchStageWithKey<Map<String,Object>, Object> fileGroupByKey =
batch2.groupingKey(b -> b.getSource1().get(col2));
//trying to join but not sure what exactly is happening.
BatchStage<Entry<String, Tuple2<List<Object>, List<Object>>>> d =
jdbcGroupByKey.aggregate2(toList(), fileGroupByKey, toList());
I want to know will batch1 and batch2 will fetch all the data first and store it and then do joining or will it do streaming of records record by record?
Can anyone explain it how exactly it work as I have to work with big data also and in that case if it's store data it can cause server.
Upvotes: 1
Views: 67
Reputation: 10812
In your case it's a so-called global aggregation. Since we don't know the order of the input, we have to accumulate whole input. And since your aggregation is toList()
, we have to store all the items. Therefore, to run this job you'll need to load all items into memory.
There's also more to that - we optimize the pipeline for throughput, not for memory usage, so we'll do so called two-stage aggregation. In the first stage we'll group by the key locally, and then send the pre-aggregated results over network to calculate the final aggregation. Since you're aggregating to a list, the job will in the end need double the amount of all items in both sources.
Currently there's no easy way to do it more efficiently if, for example, one or both sides are indexed by the grouping key.
Upvotes: 1