Reputation: 755179
When parsing HTML for certain web pages (most notably, any windows live page) I encounter a lot of URL’s in the following format.
http\x3a\x2f\x2fjs.wlxrs.com\x2fjt6xQREgnzkhGufPqwcJjg\x2fempty.htm
These appear to be partially UTF8 escaped strings (\x2f = /, \x3a=:, etc …). Is there a .Net API that can be used to transform these strings into a System.Uri? Seems easy enough to parse but I’m trying to avoid building a new wheel today.
Upvotes: 8
Views: 1457
Reputation: 148644
here is another solution : (as continued from @timwi solution)
string output = Regex.Replace(input, @"\\x([0-9a-f][0-9a-f])",
m => ((char)Convert.ToInt32(m.Groups[1].Value, 16)).ToString());
Upvotes: 0
Reputation: 66594
What you posted is not valid HTTP. As such, of course HttpUtility.UrlDecode()
won't work. But irrespective of that, you can turn this back into normal text like this:
string input = @"http\x3a\x2f\x2fjs.wlxrs.com\x2fjt6xQREgnzkhGufPqwcJjg\x2fempty.htm";
string output = Regex.Replace(input, @"\\x([0-9a-f][0-9a-f])",
m => ((char) int.Parse(m.Groups[1].Value, NumberStyles.HexNumber)).ToString());
But notice that this assumes that the encoding is Latin-1 rather than UTF-8. The input you provided is inconclusive in that respect. If you need UTF-8 to work, you need a slightly longer route; you'll have to convert the string to bytes and replace the escape sequences with the relevant bytes in the process (probably needs a while loop), and then use Encoding.UTF8.GetString()
on the resulting byte array.
Upvotes: 3