Reputation: 902
Basically, this post is a challenge. I've been trying to optimize an HTML escape function today with moderate success. But I know there are some serious Java hackers out there who can probably do this way better than me and I'd love to learn.
I've been profiling my Java web app and have found that a major hotspot was our String escaping function. We currently use Apache Commons Lang for this task, calling StringEscapeUtils.escapeHtml(). I assumed since it is so widely used it would be reasonably fast, but even my most naive implementation was significantly faster.
Here's the benchmark code I used along with my Naive implementation. It tests a strings of varying length, some containing nothing but plain text and some containing HTML that needs escaping.
public class HTMLEscapeBenchmark {
public static String escapeHtml(String text) {
if (text == null) return null;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < text.length(); i++) {
char c = text.charAt(i);
if (c == '&') {
sb.append("&");
} else if (c == '\'') {
sb.append("'");
} else if (c == '"') {
sb.append(""");
} else if (c == '<') {
sb.append("<");
} else if (c == '>') {
sb.append(">");
} else {
sb.append(c);
}
}
return sb.toString();
}
/*
public static String escapeHtml(String text) {
if (text == null) return null;
return StringEscapeUtils.escapeHtml(text);
}
*/
public static void main(String[] args) {
final int RUNS = 5;
final int ITERATIONS = 1000000;
// Standard lorem ipsum text.
String loremIpsum = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut " +
"labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut " +
"aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum " +
"dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia " +
"deserunt mollit anim id est laborum. ";
while (loremIpsum.length() < 1000) loremIpsum += loremIpsum;
// Add some characters that need HTML escaping. Bold every 2 and 3 letter word, quote every 5 letter word.
String loremIpsumHtml = loremIpsum.replaceAll("[A-Za-z]{2}]", "<b>$0</b>").replaceAll("[A-Za-z]{5}", "\"$0\"");
System.out.print("\nNormal-10");
String text = loremIpsum.substring(0, 10);
for (int run = 1; run <= RUNS; run++) {
long start = System.nanoTime();
for (int i = 0; i < ITERATIONS; i++) {
escapeHtml(text);
}
System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
}
System.out.print("\nNormal-100");
text = loremIpsum.substring(0, 100);
for (int run = 1; run <= RUNS; run++) {
long start = System.nanoTime();
for (int i = 0; i < ITERATIONS; i++) {
escapeHtml(text);
}
System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
}
System.out.print("\nNormal-1000");
text = loremIpsum.substring(0, 1000);
for (int run = 1; run <= RUNS; run++) {
long start = System.nanoTime();
for (int i = 0; i < ITERATIONS; i++) {
escapeHtml(text);
}
System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
}
System.out.print("\nHtml-10");
text = loremIpsumHtml.substring(0, 10);
for (int run = 1; run <= RUNS; run++) {
long start = System.nanoTime();
for (int i = 0; i < ITERATIONS; i++) {
escapeHtml(text);
}
System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
}
System.out.print("\nHtml-100");
text = loremIpsumHtml.substring(0, 100);
for (int run = 1; run <= RUNS; run++) {
long start = System.nanoTime();
for (int i = 0; i < ITERATIONS; i++) {
escapeHtml(text);
}
System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
}
System.out.print("\nHtml-1000");
text = loremIpsumHtml.substring(0, 1000);
for (int run = 1; run <= RUNS; run++) {
long start = System.nanoTime();
for (int i = 0; i < ITERATIONS; i++) {
escapeHtml(text);
}
System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
}
}
}
On my two year old MacBook pro, I get the following results.
Commons Lang StringEscapeUtils.escapeHtml
Normal-10 0.439 0.357 0.351 0.343 0.342
Normal-100 2.244 0.934 0.930 0.932 0.931
Normal-1000 8.993 9.020 9.007 9.043 9.052
Html-10 0.270 0.259 0.258 0.258 0.257
Html-100 1.769 1.753 1.765 1.754 1.759
Html-1000 17.313 17.479 17.347 17.266 17.246
Naive Implementation
Normal-10 0.111 0.091 0.086 0.084 0.088
Normal-100 0.636 0.627 0.626 0.626 0.627
Normal-1000 5.740 5.755 5.721 5.728 5.720
Html-10 0.145 0.138 0.138 0.138 0.138
Html-100 0.899 0.901 0.896 0.901 0.900
Html-1000 8.249 8.288 8.272 8.262 8.284
I'll post my own best attempt at optimization as an answer. So, my question is, can you do better? What is the fastest possible method for escaping HTML?
Upvotes: 1
Views: 1414
Reputation: 719259
I assumed since it is so widely used it would be reasonably fast, but even my most naive implementation was significantly faster.
If you look at the source code of the Apache version (for example) you will see that it is dealing with a number of cases that your version ignores:
In short, it is slower because it is more general.
Upvotes: 2
Reputation: 902
Here's my best attempt at optimizing it. I optimized for what I hope is the common case of plain text strings, but I wasn't able to make it much better for strings with HTML entities.
public static String escapeHtml(String value) {
if (value == null) return null;
int length = value.length();
String encoded;
for (int i = 0; i < length; i++) {
char c = value.charAt(i);
if (c <= 62 && (encoded = getHtmlEntity(c)) != null) {
// We found a character to encode, so we need to start from here and buffer the encoded string.
StringBuilder sb = new StringBuilder((int) (length * 1.25));
sb.append(value.substring(0, i));
sb.append(encoded);
i++;
for (; i < length; i++) {
c = value.charAt(i);
if (c <= 62 && (encoded = getHtmlEntity(c)) != null) {
sb.append(encoded);
} else {
sb.append(c);
}
}
value = sb.toString();
break;
}
}
return value;
}
private static String getHtmlEntity(char c) {
switch (c) {
case '&': return "&";
case '\'': return "'";
case '"': return """;
case '<': return "<";
case '>': return ">";
default: return null;
}
}
Normal-10 0.021 0.023 0.011 0.012 0.011
Normal-100 0.074 0.074 0.073 0.074 0.074
Normal-1000 0.671 0.678 0.675 0.675 0.680
Html-10 0.222 0.152 0.153 0.153 0.154
Html-100 0.739 0.715 0.718 0.724 0.706
Html-1000 6.812 6.828 6.802 6.802 6.806
Upvotes: 3