Reputation: 1501

Java escape HTML - string replace slow?

I have a Java application that makes heavy use of a large file, to read, process and give through to SolrEmbeddedServer (http://lucene.apache.org/solr/).

One of the functions does basic HTML escaping:

private String htmlEscape(String input)
{
    return input.replace("&", "&amp;").replace(">", "&gt;").replace("<", "&lt;")
        .replace("'", "&apos;").replaceAll("\"", "&quot;");
}

While profiling the application, the program spends roughly 58% of the time in this function, a total of 47% in replace, and 11% in replaceAll.

Now, is the Java replace that slow, or am I on the right path and should I consider the program efficient enough to have its bottleneck in Java and not in my code? (Or am I replacing wrong?)

Thanks in advance!

Upvotes: 2

Answers (8)

Stephan

Reputation: 43053

For the casual reader, there is a new player in the Html escape field: unbescape.

An unescape operation on HTML code can be done like this:

final String unescapedText = HtmlEscape.unescapeHtml(escapedText);

Upvotes: 0

Pongvech Vechprasit

Reputation: 11

It's much easier and more standard to use http://commons.apache.org/lang/. It's very easy and simple.

Upvotes: 1

amarillion

Reputation: 24937

This is certainly not the most efficient way to do a lot of replacements. Since strings are immutable, each .replace() leads to the construction of a new String object. For the example you give, each call to this function leads to the temporary creation of 6 String objects.

Considering the example you give, the simplest solution is to use an existing library function for HTML entity encoding. Apache commons StringEscapeUtils is one option. Another one is HTMLEntities

Upvotes: 3

matt b

Reputation: 139981

Each call to replace returns a new String. Each time you call this function, you are essentially creating four copies of Strings which are going to be immediately discarded. If input is large enough, this can be wasteful.

I would suggest revising your algorithm so that instead of performing N replace operations (which needs to scan the String each time), you only scan the list once:

//psuedocode
Map<Char, String> replacements = new HashMap<String, String>();
replacements.put("&", "&amp;");
replacements.put(">", "&gt;");
...
private String htmlEscape(String input) {
    StringBuilder sb = new StringBuilder(input.length());
    for (char c: sb.toCharArray()) {
    if (replacements.containsKey(c)) {
        sb.append(replacements.get(c));
    else {
        sb.append(c);
    }
    return sb.toString();
}

Upvotes: 1

Tom Hawtin - tackline

Reputation: 147164

The general algorithm for String.replace is a little complicated, but it shouldn't be that bad. Looking at the code, it is in fact implemented using regex so wont be fast - ick.

Obviously, you can write much faster code by iterating through character by character. Possibly working out the exact length first.

You might want to consider how characters outside of[ -~] are handled. You might also want to use a library which has already implemented the functionality.

Upvotes: 0

Frederik

Reputation: 14556

Your approach with multiple replace methods could be slow.

Look at Apache Commons Lang's StringEscapeUtils for a speedy implementation of escaping HTML entities.

Upvotes: 0

skaffman

Reputation: 403551

Apache Commons Lang has a very efficient escapeHtml method in its StringEscapeUtils class.

It's fairly smart about it, and doesn't use string replacement in the way you describe, but instead iterates through the characters, replacing characters with appropriate entities as it finds them.

I don't have any benchmarks handy, but if this stuff is on the critical path of your code, you would be wiese to use this off-the-shelf, faster solution.

Upvotes: 1

Bozho

Reputation: 597244

For html escaping you can use StringEscapeUtils.escapeHtml(input) from commons-lang. It is supposedly implemented in a more efficient way there.

Upvotes: 8

Java escape HTML - string replace slow?

Answers (8)

Related Questions