Reputation: 1313
I have lines of data coming from a script which typically look like this (single line example):
1234567890;group1;varname1;133333337;prop1=val1;prop2=val2;prop3=val3
I need to break each line into Key-Value items for a Map, each item being separated by a separator string (;
in the example, but it can be a custom one too). The first 4 items are static, meaning that only the value is in the line, and the keys are already known. The rest is a variable number of key-value items (0 or more key=value
chunks). Please take a look at the output below first to give you an idea.
I already have two working methods to accomplish that, where both throw me the same output for a same line. I have set up a test class to demonstrate the two methods at work along with some (simple) performance analysis just out of curiosity. Take note that invalid input handling is minimum in the methods shown below.
String Splitting (using Apache Commons):
private static List<String> splitParsing(String dataLine, String separator) {
List<String> output = new ArrayList<String>();
long begin = System.nanoTime();
String[] data = StringUtils.split(dataLine, separator);
if (data.length >= STATIC_PROPERTIES.length) {
// Static properties (always there).
for (int i = 0; i < STATIC_PROPERTIES.length; i++) {
output.add(STATIC_PROPERTIES[i] + " = " + data[i]);
}
// Dynamic properties (0 or more).
for (int i = STATIC_PROPERTIES.length; i < data.length; i++) {
String[] fragments = StringUtils.split(data[i], KEYVALUE_SEPARATOR);
if (fragments.length == 2) {
output.add(fragments[0] + " = " + fragments[1]);
}
}
}
long end = System.nanoTime();
output.add("Execution time: " + (end - begin) + "ns");
return output;
}
Regex (using JDK 1.6):
private static List<String> regexParsing(String dataLine, String separator) {
List<String> output = new ArrayList<String>();
long begin = System.nanoTime();
Pattern linePattern = Pattern.compile(StringUtils.replace(DATA_PATTERN_TEMPLATE, SEP, separator));
Pattern propertiesPattern = Pattern.compile(StringUtils.replace(PROPERTIES_PATTERN_TEMPLATE, SEP, separator));
Matcher lineMatcher = linePattern.matcher(dataLine);
if (lineMatcher.matches()) {
// Static properties (always there).
for (int i = 0; i < STATIC_PROPERTIES.length; i++) {
output.add(STATIC_PROPERTIES[i] + " = " + lineMatcher.group(i + 1));
}
Matcher propertiesMatcher = propertiesPattern.matcher(lineMatcher.group(STATIC_PROPERTIES.length + 1));
while (propertiesMatcher.find()) {
output.add(propertiesMatcher.group(1) + " = " + propertiesMatcher.group(2));
}
}
long end = System.nanoTime();
output.add("Execution time: " + (end - begin) + "ns");
return output;
}
Main method:
public static void main(String[] args) {
String input = "1234567890;group1;varname1;133333337;prop1=val1;prop2=val2;prop3=val3";
System.out.println("Split parsing:");
for (String line : splitParsing(input, ";")) {
System.out.println(line);
}
System.out.println();
System.out.println("Regex parsing:");
for (String line : regexParsing(input, ";")) {
System.out.println(line);
}
}
Constants:
// Common constants.
private static final String TIMESTAMP_KEY = "TMST";
private static final String GROUP_KEY = "GROUP";
private static final String VARIABLE_KEY = "VARIABLE";
private static final String VALUE_KEY = "VALUE";
private static final String KEYVALUE_SEPARATOR = "=";
private static final String[] STATIC_PROPERTIES = { TIMESTAMP_KEY, GROUP_KEY, VARIABLE_KEY, VALUE_KEY };
// Regex constants.
private static final String SEP = "{sep}";
private static final String PROPERTIES_PATTERN_TEMPLATE = SEP + "(\\w+)" + KEYVALUE_SEPARATOR + "(\\w+)";
private static final String DATA_PATTERN_TEMPLATE = "(\\d+)" + SEP + "(\\w+)" + SEP + "(\\w+)" + SEP + "(\\d+\\.?\\d*)"
+ "((?:" + PROPERTIES_PATTERN_TEMPLATE + ")*)";
Output from main method:
Split parsing:
TMST = 1234567890
GROUP = group1
VARIABLE = varname1
VALUE = 133333337
prop1 = val1
prop2 = val2
prop3 = val3
Execution time: 8695796ns
Regex parsing:
TMST = 1234567890
GROUP = group1
VARIABLE = varname1
VALUE = 133333337
prop1 = val1
prop2 = val2
prop3 = val3
Execution time: 1250787ns
Judging from the output (which I ran multiple times), it seems that the regex method is more efficient in terms of performance, even though my initial thoughts were more towards the splitting method. However, I'm not certain how representative this performance analysis is.
My questions are:
Matcher
s and Pattern
s must have a somewhat more complex logic behind them. Is my performance analysis even representative?Upvotes: 1
Views: 1486
Reputation: 1313
In the end, after testing and playing with it, I went with the String Splitting method for the following reasons:
With splitting, I can easily figure out what part of the analysed string fails and log a precise and useful warning message, where as with regexes it is way more difficult to do so.
Since the content of the parsed line does not need to be verified or validated, I found it more reliable to use splitting. With regexes, I always ended up with something too restrictive on the content or something too loose that gave unexpected results.
With the splitting method, I simply split with the separator, pack every Key-Value pair, and that's it.
Matcher
s and Pattern
s must have a somewhat more complex logic behind them. Is my performance analysis even representative?Thanks to Stye's answer for that part, especially with the interesting reference to the Split/Match/indexOf experiment.
Upvotes: 0
Reputation: 110
I'll try to tackle your questions :
I believe the Matcher method, because you can simply iterate over your declared array of Patterns and use the Matcher#usePattern(Pattern P)
on each. I find it clean and clear, packing all the desired regexes in one place and running them in a quick for each.
You're using Apache Commons implementation of split. Per their documentation they are using a specialized implementation of String Tokenizer, which as shown by experiment is slower than String#split(Str regex)
(which uses String#indexOf()
) and also slower than Matcher&Pattern approach.
Generic question, but I'd go with Apache Commons approach. One safety advantage of it is it does nullchecks for you. Quoting the description of StringUtils class : "Operations on String that are null safe." (quoted from StringUtils documentation, link posted in answer to 2nd question). Other than that, it all depends on you : )
Upvotes: 1