SparkStreaming analyzing strings vs initializing specific classes overhead

Question

In my spark application I receive comma separated strings, which I then split and analyze as Arrays. I have the indexes well determined at the beginning of my main, all of them static final int. The code readability is good enough, although not exactly cristal clear and it becomes kind of troublesome after some time to keep track of everything. A pseudocode example:

data = receivedString.split(",");
rdd.map({
    someOperation = operation(data[CONSTANT_INDEX]);
    someOtherOperation = otherOperation(data[INDEX],data[INDEX2];
    data[RESULT_INDEX] = thirdOperation(data[THIRD_INDEX];});

In another part of code, I tried using a specific class to host my data and to operate on it: much easier to track my operations. For example:

rdd.map({MyClass class = new MyClass(String[]);
some operation = operation(MyClass.getElement1);
increaseOperation = MyClass.increaseValue();
MyClass.setOtherValue(thirdOperation(MyClass.getOtherValue));});

If I have 100s of lines per second 24/7, which approach should I use? Will I stress too much my environment with the class creation overhead?
I hope this doesn't step into the "personal opinion" flag: which approach do you personally use and how do you choose it?

maasg · Accepted Answer

"Premature optimization is the root of all evil." - Donald Knuth

Code readability, testability and maintainability are a primary goal in SW Engineering. Class creation will indeed incur in some overhead but that is probably negligible when compared to the involved I/O in a distributed process.

So, use the method that improves the code quality and only "go down to the metal" if you find a performance issue. And with "find", I mean that we used profiling techniques to determine where the performance issue is.

SparkStreaming analyzing strings vs initializing specific classes overhead

Answers (1)

Related Questions