Karada
Karada

Reputation: 13

How to optimize code without using regex?

My test assignment when hiring was to optimize this code:

someString.replaceAll("\"", "'").replaceAll("\n", "").replace("[]", "{}");

To solve this problem, I wrote this method:

public String outputValue(String input) {
   char[] inputArray = input.toCharArray();
   StringBuilder builder = new StringBuilder();
   for (int i = 0; i < inputArray.length; i++) {
      if (i != inputArray.length - 1 && inputArray[i] == '[' && inputArray[i + 1] == ']' ) {
         builder.append("{}");
         i++;
      }
      else if (inputArray[i] == '"')
         builder.append("'");
      else if (inputArray[i] == '\n' )
         continue;
      else builder.append(inputArray[i]);
   }
   return  builder.toString();
}

I had many options for solving this problem, but this seemed the most optimal. I tested the speed using the Instant class, and on any line size it showed me high results. The firm told me that in their measurements there is degradation on large lines, that this is due to an overhead for using StringBuilder and unconditional allocation of a new line at the end.

So i didn't get a job.

What solution could be even more optimized? Even using a char array, I cannot do without string allocation String.valueOf().

Upvotes: 1

Views: 126

Answers (3)

MrSmith42
MrSmith42

Reputation: 10151

Instead of using the StringBuilder you could modify the inputArray directly and use it as result (possible, because your replaces can only shorten the char-sequence)

This would be my resulting version:

 public static String outputValue(final String input) {
        final char[] dataArray = input.toCharArray();
        int destPos = 0;
        int srcPos = 0;
        final int len = dataArray.length;
        while (srcPos < len) {
            if ('"' == dataArray[srcPos]) {
                dataArray[destPos++] = '\'';
                srcPos++;
                continue;
            }
            if ('\n' == dataArray[srcPos]) {
                srcPos++;
                continue;
            }
            if ('[' == dataArray[srcPos] && srcPos < len - 1 && ']' == dataArray[srcPos + 1]) {
                dataArray[destPos++] = '{';
                dataArray[destPos++] = '}';
                srcPos += 2;
                continue;
            }
            dataArray[destPos++] = dataArray[srcPos++];
        }
        return String.copyValueOf(dataArray, 0, destPos);
    }

Thoughts about the code:

  • Memory consumption is the dataArray (and the created return-String).
  • Speed: only profiling can tell if it's faster than using a StringBuilder with a predefined size.

Speed: I made rough speed test of the original, my own and the version of rzwitserloot (with a long String dats-len: 268435456 characters output-len 251658240

At least running from within eclipse, as JUnit-test,mine is faster than both the others.

Output of my test:

date-len 268435456     
out-len  251658240  
mr smith42:   1007
rzwitserloot: 1512
Orgiginal:    1950
mr smith42:   834
rzwitserloot: 1370
Orgiginal:    1870
mr smith42:   789
rzwitserloot: 1214
Orgiginal:    1945
mr smith42:   720
rzwitserloot: 1142
Orgiginal:    1572
mr smith42:   672
rzwitserloot: 1073
Orgiginal:    1739
mr smith42:   925
rzwitserloot: 1240   
Orgiginal:    1823  
mr smith42:   834  
rzwitserloot: 1284  
Orgiginal:    1676  
mr smith42:   840  
rzwitserloot: 1230  
Orgiginal:    1621 
mr smith42:   811  
rzwitserloot: 1260  
Orgiginal:    1695  
mr smith42:   840  
rzwitserloot: 1257  
Orgiginal:    1707  
mr smith42:   868  
rzwitserloot: 1309  
Orgiginal:    1716  
mr smith42:   837  
rzwitserloot: 1262  
Orgiginal:    1683  
mr smith42:   835  
rzwitserloot: 1240  
Orgiginal:    1632  
mr smith42:   824  
rzwitserloot: 1255  
Orgiginal:    1635  
mr smith42:   844  
rzwitserloot: 1243  
....

Here my rough test: (Using a real profiling tool should lead to similar results)

 @Test
    public void test() {

        String data = "jksdhfewfnku\n[] [  {} \"  [] }\nss";
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        data += data;
        String expected = "jksdhfewfnku{} [  {} '  {} }ss";
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;
        expected += expected;

        System.err.println("date-len " + data.length());
        System.err.println("out-len  " + expected.length());

        // run it multible times to give the VM a chance for optimization
        for (int i = 20; 0 < i; i--) {
            final long t = System.currentTimeMillis();
            Assert.assertEquals(expected, T.outputValue(data));
            System.err.println("mr smith42:   " + (System.currentTimeMillis() - t));
            final long t2 = System.currentTimeMillis();
            Assert.assertEquals(expected, T.outputValue2(data));
            System.err.println("rzwitserloot: " + (System.currentTimeMillis() - t2));
            final long t3 = System.currentTimeMillis();
            Assert.assertEquals(expected, T.outputValueOrgiginal(data));
            System.err.println("Orgiginal:    " + (System.currentTimeMillis() - t3));
        }
    }

Upvotes: 1

user14410324
user14410324

Reputation:

These interview questions are more about what questions you ask the interviewer. The interviewer is the customer and you are the contractor in to solve their problem.

It's their system, you were asked to optimize. What questions are you going to ask them?

The other case is that they have no idea how to optimize that line of code and they are just running some tools someone (much more talented than they are) made to find the bottle neck.

You are there to give them free code to run through their tools. You aren't getting the job because there wasn't one to begin with. Shady? Yes. Done often in all industries and all disciplines? Yes.

Upvotes: 0

rzwitserloot
rzwitserloot

Reputation: 102812

StringBuilder builder = new StringBuilder();

StringBuilder has no idea how large the output is going to be, so you do get amortized growth cost here: SB isn't voodoo magic, it needs to guess at sizes. It'll guess 'eh, I'll give you 10 chars worth'. If you then add the 11th char, SB has to make a new char array (a larger one; I think it grows by a factor x2, might be x1.5), so it makes a 20-char array, copies over the first 10 chars, and tosses out the old array. This is not too bad (by definition this growth penalty is applied to an ever exponentially decreasing number of append operations), but...

look at the input question: The output string is as large as the input string, OR smaller (every \n in the input results in the output being 1 smaller). So, you can tell StringBuilder beforehand: Hey, just start out with this much capacity (the length of the input). Now there is no growth penalty anymore.

that this is due to an overhead for using StringBuilder

This is misleading or oversimplified; you may not have quite heard that correctly. StringBuilder is just an object, with a char array. That is overhead, yeah, but,

someString.replaceAll("\"", "'").replaceAll("\n", "").replace("[]", "{}");

that makes 2 pattern objects and (potentially) 3 strings. strings also are just an object with a big char array inside, so that's potentially more overhead and thus does not explain in the slightest why your take can be a lot slower.

However, there is another optimization you missed:

The replace/replaceAll methods don't make any string, AT ALL, if there is no need. Try it: "Hello!".replace("Foo", "Bar") returns the same reference (a == b equal).

Also, .toCharArray does require yet another full clone, also pricey. Just use charAt.

So, let's make an actual fast one:

public String outputValue(String input) {
   StringBuilder builder = null;
   int len = input.length();
   for (int i = 0; i < len; i++) {
      char c = input.charAt(i);
      if (c != '[' && c != '\n' && c != '"') {
        if (builder != null) builder.append(c);
        continue;
      }

      if (builder == null) {
        // Don't even make the builder until we determine we need it!
        builder = new StringBuilder(len);
        builder.append(input, 0, i);
      }
      if (i != leg - 1 && c == '[' && input.charAt(i + 1) == ']' ) {
         builder.append("{}");
         i++;
      } else if (c == '"') {
         builder.append('\'');
      } else if (c != '\n') {
         builder.append(c);
      }
   }
   return  builder == null ? input : builder.toString();
}

I tested the speed using the Instant class, and on any line size it showed me high results.

Not how you test performance. Instant represents system clock; it has only at best microsecond granularity (and usually not even that), and can and will change if the net time daemon changes the time on you. It can also be under smear (when a clock needs adjusting, because so much software out there assumes time can never run backwards, instead of just moving the time back, a smearing NTP will simply run time 'slower' - ticking up 750 milliseconds every second for example, until time's 'fixed').

Even nanoTime is not great, though - benchmarking is way more complicated than this. Use JMH (Java Microbenchmark Harness).

If you want to see some examples where your code does way worse than the original .replaceAll/replaceAll/replace chain, make a bunch of strings that are large and which do not contain any of these symbols (no newlines, no quotes, no brackets). You'll see a very big difference.


But, interview questions aren't neccessarily about code optimization. For example, I find these just as plausible:

  1. Change almost nothing; replace is reasonably fast. However, .replaceAll involves a regex. If this line gets executed a ton, making an explicit pattern object would be more efficient and read better (though that is in the eye of the beholder). More generally, the patterns here don't use any regex features, so maybe the interviewer just wanted you to replace replaceAll with replace and that is all. To be clear, my code above will be faster if you have a large input with newlines and quotes and brackets in it, and will not be slower if there aren't. But, not every line of code needs to be optimized to the maximum possible extent.

  2. Before digging into the code at all, the exercise tested your willingness to ask questions and look for the bigger picture first. Maybe they wanted you to ask: How often is this run? Is there a performance framework in place? What kinds of strings are usually thrown at this method? Can we perhaps look elsewhere and fix the strings 'at the source' instead of converting them every time? When I do an interview, I tend to craft technical questions that seem complete but which have some missing information once you dig down enough - if the interviewee doesn't ask, I dock them some points. It's not an instant fail - I'll tell em and see if they bother to ask and debate first before diving into programming for the next question.

Upvotes: 2

Related Questions