sasjaq
sasjaq

Reputation: 771

HTML/Url decode on multiple times encoded string

We have a string which is readed from web page. Because browsers are tolerant to unencoded special chars (e.g. ampersand), some pages using it encoded, some not... so there is a large possibility, we have stored some data encoded once, and some multiple times...

Is there some clear solution, how to be sure, my string is decoded enough no matter how many times it was encoded?

Here is what we using now:

public static string HtmlDecode(this string input)
{
     var temp = HttpUtility.HtmlDecode(input);
     while (temp != input)
     {
         input = temp;
         temp = HttpUtility.HtmlDecode(input);
     }
     return input;
}

and same using with UrlDecode.

Upvotes: 6

Views: 4524

Answers (3)

Dimitar Dimitrov
Dimitar Dimitrov

Reputation: 15148

In case this is helpful to anyone, here is a recursive version for multiple HTML encoded strings (I find it a bit easier to read):

public static string HtmlDecode(string input) {
    string decodedInput = WebUtility.HtmlDecode(input);

    if (input == decodedInput) {
        return input;
    }

    return HtmlDecode(decodedInput);
}

WebUtility is in the System.Net namespace.

Upvotes: 1

LakshmiNarayanan
LakshmiNarayanan

Reputation: 1188

Your code seems to be decoding html strings correctly, with multiple checks.

However, if the input HTML is malformed, i.e not encoded properly, the decoding will be unexpected. i.e bad inputs might not be decoded properly no matter how many times it passes through this method.

A quick check with two encoded strings, one with completely encoded string, and another with partially encoded yielded the following results.

"&lt;b&gt;" will decode to "<b>"

"&lt;b&gt will decode to "<b&gt"

Upvotes: 1

Haney
Haney

Reputation: 34802

That's probably the best approach honestly. The real solution would be to rework your code so that you only singly encode things in all places, so that you could only singly decode them.

Upvotes: 3

Related Questions