a.ndrea
a.ndrea

Reputation: 536

Suggested Architecture for a batch with multi-threading and common resources

I need to write a batch in Java that using multiple threads perform various operation on a bunch of data.
I got almost 60k rows of data, and need to do different operations on them. Some of them works on the same data but using different outputs.
So, the question is: is it right to create this big 60k-length ArrayList and pass it through the various operator, so they can add each one their output, or there is a better Architecture Design that someone can suggest me?

EDIT:
I need to create these objects:

MyObject, with an ArrayList of MyObject2, 3 different Integers, 2 Strings. MyObject2, with 12 floats MyBigObject, with an ArrayList of MyObjectof usually of 60k elements, and some Strings.

My different operators works on the same ArrayList of MyObject2, but outputs on the integers, so for example Operators1 fetch from ArrayList of MyObject2, perform some calculation and output its result on MyObject.Integer1, Operators2 fetch from ArrayList of MyObject2, perform some different calculation and output its result on MyObject.Integer2, and so on.

Is this architecture "safe"? The ArrayList of MyObject2 has to be read only, never edited from any operator.

EDIT: Actually I don't have still code because I'm studying the architecture before, and then I'll start writing something.
Trying to rephrase my question:

Is it ok, in a Batch written in pure Java (without any Framework, I'm not using for example Spring Batch because it will be like shooting a fly with a shotgun for my project), to create a macro object, pass it around so that every different thread can read from the same datas but output their results on different datas? Can it be dangerous if different threads reads from the same data at the same time?

Upvotes: 1

Views: 401

Answers (1)

It depends on your operations.

Generally it's possible to partition work on a dataset horizontally or vertically.

Horizontally means splitting your dataset into several smaller sets let each individual thread handle such a set. This code is safest yet usually slower because each individual thread will do several different operations. It's also a bit more complex to reason about for the same reason.

Vertically means each thread performs some operation on a specific "field" or "column" or whatever individual data units is in your data set. This is generally easier to implement (each thread does one thing on the whole set) and can be faster. However each operation on the dataset needs to be independent of your other operations. If you are unsure about multi-threading in general, I recommend doing work horizontally in parallel.

Now to the question about whether is ok to pass your full dataset around (some ArrayList), sure it is! It's just a reference and won't really matter. What matters are the operations you perform on the dataset.

Upvotes: 2

Related Questions