java replace HTML_Escapecodes

Question

i need to develope a new methode, that should replace all Umlaute (ä, ö, ü) of a string entered with high performance with the correspondent HTML_Escapecodes. According to statistics only 5% of all strings entered contain Umlauts. As it is supposed that the method will be used extensively, any instantiation that is not necessary should be avoided. Could someone show me a way to do it?

Ghostkeeper · Accepted Answer

These are the HTML escape codes. Additionally, HTML features arbitrary escaping with codes of the format : and equivalently :

A simple string-replace is not going to be efficient with so many strings to replace. I suggest you split the string by entity matches, such as this:

String[] parts = str.split("&([A-Za-z]+|[0-9]+|x[A-Fa-f0-9]+);");
if(parts.length <= 1) return str; //No matched entities.

Then you can re-build the string with the replaced parts inserted.

StringBuilder result = new StringBuilder(str.length());
result.append(parts[0]); //First part always exists.
int pos = parts[0].length + 1; //Skip past the first entity and the ampersand.
for(int i = 1;i < parts.length;i++) {
    String entityName = str.substring(pos,str.indexOf(';',pos));
    if(entityName.matches("x[A-Fa-f0-9]+") && entityName.length() <= 5) {
        result.append((char)Integer.decode("0" + entityName));
    } else if(entityName.matches("[0-9]+")) {
        result.append((char)Integer.decode(entityName));
    } else {
        switch(entityName) {
            case "euml": result.append('ë'); break;
            case "auml": result.append('ä'); break;
            ...
            default: result.append("&" + entityName + ";"); //Unknown entity. Give the original string.
        }
    }
    result.append(parts[i]); //Append the text after the entity.
    pos += entityName.length() + parts[i].length() + 2; //Skip past the entity name, the semicolon and the following part.
}
return result.toString();

Rather than copy-pasting this code, type it in your own project by hand. This gives you the opportunity to look at how the code actually works. I didn't run this code myself, so I can't guarantee it being correct. It can also be made slightly more efficient by pre-compiling the regular expressions.

java replace HTML_Escapecodes

Answers (1)

Related Questions