Parsing a line of data: split vs regex

Question

I have lines of data coming from a script which typically look like this (single line example):

1234567890;group1;varname1;133333337;prop1=val1;prop2=val2;prop3=val3

I need to break each line into Key-Value items for a Map, each item being separated by a separator string (; in the example, but it can be a custom one too). The first 4 items are static, meaning that only the value is in the line, and the keys are already known. The rest is a variable number of key-value items (0 or more key=value chunks). Please take a look at the output below first to give you an idea.

I already have two working methods to accomplish that, where both throw me the same output for a same line. I have set up a test class to demonstrate the two methods at work along with some (simple) performance analysis just out of curiosity. Take note that invalid input handling is minimum in the methods shown below.

String Splitting (using Apache Commons):

private static List splitParsing(String dataLine, String separator) {
    List output = new ArrayList();
    long begin = System.nanoTime();

    String[] data = StringUtils.split(dataLine, separator);

    if (data.length >= STATIC_PROPERTIES.length) {
        // Static properties (always there).
        for (int i = 0; i < STATIC_PROPERTIES.length; i++) {
            output.add(STATIC_PROPERTIES[i] + " = " + data[i]);
        }

        // Dynamic properties (0 or more).
        for (int i = STATIC_PROPERTIES.length; i < data.length; i++) {
            String[] fragments = StringUtils.split(data[i], KEYVALUE_SEPARATOR);
            if (fragments.length == 2) {
                output.add(fragments[0] + " = " + fragments[1]);
            }
        }
    }

    long end = System.nanoTime();
    output.add("Execution time: " + (end - begin) + "ns");
    return output;
}

Regex (using JDK 1.6):

private static List regexParsing(String dataLine, String separator) {
    List output = new ArrayList();
    long begin = System.nanoTime();

    Pattern linePattern = Pattern.compile(StringUtils.replace(DATA_PATTERN_TEMPLATE, SEP, separator));
    Pattern propertiesPattern = Pattern.compile(StringUtils.replace(PROPERTIES_PATTERN_TEMPLATE, SEP, separator));

    Matcher lineMatcher = linePattern.matcher(dataLine);
    if (lineMatcher.matches()) {
        // Static properties (always there).
        for (int i = 0; i < STATIC_PROPERTIES.length; i++) {
            output.add(STATIC_PROPERTIES[i] + " = " + lineMatcher.group(i + 1));
        }

        Matcher propertiesMatcher = propertiesPattern.matcher(lineMatcher.group(STATIC_PROPERTIES.length + 1));
        while (propertiesMatcher.find()) {
            output.add(propertiesMatcher.group(1) + " = " + propertiesMatcher.group(2));
        }
    }

    long end = System.nanoTime();
    output.add("Execution time: " + (end - begin) + "ns");
    return output;
}

Main method:

public static void main(String[] args) {
    String input = "1234567890;group1;varname1;133333337;prop1=val1;prop2=val2;prop3=val3";

    System.out.println("Split parsing:");
    for (String line : splitParsing(input, ";")) {
        System.out.println(line);
    }

    System.out.println();

    System.out.println("Regex parsing:");
    for (String line : regexParsing(input, ";")) {
        System.out.println(line);
    }
}

Constants:

// Common constants.
private static final String TIMESTAMP_KEY = "TMST";
private static final String GROUP_KEY = "GROUP";
private static final String VARIABLE_KEY = "VARIABLE";
private static final String VALUE_KEY = "VALUE";
private static final String KEYVALUE_SEPARATOR = "=";
private static final String[] STATIC_PROPERTIES = { TIMESTAMP_KEY, GROUP_KEY, VARIABLE_KEY, VALUE_KEY };

// Regex constants.
private static final String SEP = "{sep}";
private static final String PROPERTIES_PATTERN_TEMPLATE = SEP + "(\w+)" + KEYVALUE_SEPARATOR + "(\w+)";
private static final String DATA_PATTERN_TEMPLATE = "(\d+)" + SEP + "(\w+)" + SEP + "(\w+)" + SEP + "(\d+\.?\d*)"
        + "((?:" + PROPERTIES_PATTERN_TEMPLATE + ")*)";

Output from main method:

Split parsing:
TMST = 1234567890
GROUP = group1
VARIABLE = varname1
VALUE = 133333337
prop1 = val1
prop2 = val2
prop3 = val3
Execution time: 8695796ns

Regex parsing:
TMST = 1234567890
GROUP = group1
VARIABLE = varname1
VALUE = 133333337
prop1 = val1
prop2 = val2
prop3 = val3
Execution time: 1250787ns

Judging from the output (which I ran multiple times), it seems that the regex method is more efficient in terms of performance, even though my initial thoughts were more towards the splitting method. However, I'm not certain how representative this performance analysis is.

My questions are:

Which of these two methods would be best or easier to work with for invalid input handling? (Ex.: static item missing, invalid format, etc.).
Which of these methods is less likely to produce unexpected behaviour?
Why is the regex method faster? I would have assumed the opposite since Matchers and Patterns must have a somewhat more complex logic behind them. Is my performance analysis even representative?

Alexis Leclerc · Accepted Answer

In the end, after testing and playing with it, I went with the String Splitting method for the following reasons:

Which of these two methods would be best or easier to work with for invalid input handling? (Ex.: static item missing, invalid format, etc.).

With splitting, I can easily figure out what part of the analysed string fails and log a precise and useful warning message, where as with regexes it is way more difficult to do so.

Which of these methods is less likely to produce unexpected behaviour?

Since the content of the parsed line does not need to be verified or validated, I found it more reliable to use splitting. With regexes, I always ended up with something too restrictive on the content or something too loose that gave unexpected results.

With the splitting method, I simply split with the separator, pack every Key-Value pair, and that's it.

Why is the regex method faster? I would have assumed the opposite since Matchers and Patterns must have a somewhat more complex logic behind them. Is my performance analysis even representative?

Thanks to Stye's answer for that part, especially with the interesting reference to the Split/Match/indexOf experiment.

Parsing a line of data: split vs regex

Answers (2)

Related Questions