Reputation: 776
I use a forum that has a policy against direct commercial links, so what I often do is to mangle it so it remains readable but requires manual copy/paste/edit in order to work. Instead of www.example.com I will use www•example•com . The SO post editor encodes that URI as you'd expect, replacing the •
with %E2%80%A2
(so https://www%E2%80%A2example%E2%80%A2com
) but when I click the link I'm taken to https://xn--wwwexamplecom-kt6gha . That is also the HREF that the forum sends back after posting.
The xn--
header seems to be constant, the "glueing" of the 1st two domain components too but annoyingly the rest varies as a function of the domain name. The -kt6gha
bit is domain-specific and the TLD can be glued to the rest as here or come after that alphanumeric part.
I'm guessing this conversion is deterministic, but can it be reversed? Preferably in a userscript.js so I can undo my own smart move for myself? ;)
Upvotes: -3
Views: 103
Reputation: 776
So this turns out to be the punycode
, which is intended for the encoding of labels in the Internationalized Domain Names in Applications (IDNA) framework, such that these domain names may be represented in the [7-bit] ASCII character set allowed in the Domain Name System of the Internet
.
I extracted and adapted the decoder from https://stackoverflow.com/a/301287/1460868 such that it works on full URLs:
this.ToUnicode = function ( domain ) {
var protocol = '';
if (domain.startsWith('https://')) {
protocol = 'https://';
domain = domain.substring(8);
} else if (domain.startsWith('http://')) {
protocol = 'http://';
domain = domain.substring(8);
}
var ua = domain.split('/');
domain = ua[0];
urlpath = ua.slice(1);
var domain_array = domain.split(".");
var out = [];
for (var i=0; i < domain_array.length; ++i) {
var s = domain_array[i];
out.push(
s.match(/^xn--/) ?
punycode.decode(s.slice(4)) :
s
);
}
var result = protocol + out.join(".") + '/' + urlpath.join('/');
return result;
}
(that's the modified bit, apart from the stripped encoding functions.)
I can now call that in this snippet that does some unmangling of links done by silly upstream forum filters:
// also do the same replacements in the URLs
var links = document.getElementsByTagName('a');
for (i = 0; i < links.length; i++) {
var link = /[\/\.]xn--/.test(links[i].href) ?
punycode.ToUnicode(links[i].href)
: links[i].href;
urlRegexs.forEach(function (value, index) {
var newlink = link.replace(value, urlReplacements[index]);
if (newlink !== link) {
links[i].href = newlink;
}
});
}
What I don't get though is why browsers do not do this, if the encoding is part of a standard!
Upvotes: 0