Reputation: 151
I have one huge csv file. I have a Jet cluster with 3 nodes. When the job is submitted only one node processes the entire file. What I want is the each part of work can be distributed. Meaning, how can I optimally use the resources in each of the nodes to get the work done faster.
Pipeline p = Pipeline.create();
BatchSource<List<String>> source = Sources.filesBuilder("files")
.glob("*.csv")
.build(path -> Files.lines(path).skip(1).map(line -> split(line)));
p.readFrom(source)
.map(function1)
.map(function2)
.writeTo(Sinks.filesBuilder("out").build());
instance.newJob(p).join();
Upvotes: 1
Views: 73
Reputation: 886
In Jet 4.2 the rebalance() operator was introduced, I think this will do exactly what you need. By default a non-partitioned data source gets processed on a single node, but adding a rebalance() will distribute the work.
See https://jet-start.sh/docs/api/more-transforms#rebalance
The rebalance() would go between the readFrom(source) and map(function1)
Upvotes: 1