Reputation: 755179

Finding a parsing API for partially utf8 encoded URL's

When parsing HTML for certain web pages (most notably, any windows live page) I encounter a lot of URL’s in the following format.

http\x3a\x2f\x2fjs.wlxrs.com\x2fjt6xQREgnzkhGufPqwcJjg\x2fempty.htm

These appear to be partially UTF8 escaped strings (\x2f = /, \x3a=:, etc …). Is there a .Net API that can be used to transform these strings into a System.Uri? Seems easy enough to parse but I’m trying to avoid building a new wheel today.

Upvotes: 8

Answers (3)

Royi Namir

Reputation: 148644

here is another solution : (as continued from @timwi solution)

string output = Regex.Replace(input, @"\\x([0-9a-f][0-9a-f])",
            m => ((char)Convert.ToInt32(m.Groups[1].Value, 16)).ToString());

Upvotes: 0

Timwi

Reputation: 66594

What you posted is not valid HTTP. As such, of course HttpUtility.UrlDecode() won't work. But irrespective of that, you can turn this back into normal text like this:

string input = @"http\x3a\x2f\x2fjs.wlxrs.com\x2fjt6xQREgnzkhGufPqwcJjg\x2fempty.htm";
string output = Regex.Replace(input, @"\\x([0-9a-f][0-9a-f])",
    m => ((char) int.Parse(m.Groups[1].Value, NumberStyles.HexNumber)).ToString());

But notice that this assumes that the encoding is Latin-1 rather than UTF-8. The input you provided is inconclusive in that respect. If you need UTF-8 to work, you need a slightly longer route; you'll have to convert the string to bytes and replace the escape sequences with the relevant bytes in the process (probably needs a while loop), and then use Encoding.UTF8.GetString() on the resulting byte array.

Upvotes: 3

leppie

Reputation: 117280

Did you try HttpUtility.UrlDecode?

Upvotes: -1

Finding a parsing API for partially utf8 encoded URL&#39;s

Answers (3)

Related Questions

Finding a parsing API for partially utf8 encoded URL's