Reputation: 1124
In my spark application I receive comma separated strings, which I then split and analyze as Arrays. I have the indexes well determined at the beginning of my main,
all of them static final int
. The code readability is good enough, although not exactly cristal clear and it becomes kind of troublesome after some time to keep track of everything. A pseudocode example:
data = receivedString.split(",");
rdd.map({
someOperation = operation(data[CONSTANT_INDEX]);
someOtherOperation = otherOperation(data[INDEX],data[INDEX2];
data[RESULT_INDEX] = thirdOperation(data[THIRD_INDEX];});
In another part of code, I tried using a specific class to host my data and to operate on it: much easier to track my operations. For example:
rdd.map({MyClass class = new MyClass(String[]);
some operation = operation(MyClass.getElement1);
increaseOperation = MyClass.increaseValue();
MyClass.setOtherValue(thirdOperation(MyClass.getOtherValue));});
If I have 100s of lines per second 24/7, which approach should I use? Will I stress too much my environment with the class creation overhead?
I hope this doesn't step into the "personal opinion" flag: which approach do you personally use and how do you choose it?
Upvotes: 0
Views: 34
Reputation: 37435
"Premature optimization is the root of all evil." - Donald Knuth
Code readability, testability and maintainability are a primary goal in SW Engineering. Class creation will indeed incur in some overhead but that is probably negligible when compared to the involved I/O in a distributed process.
So, use the method that improves the code quality and only "go down to the metal" if you find a performance issue. And with "find", I mean that we used profiling techniques to determine where the performance issue is.
Upvotes: 1