Samuel Dauzon
Samuel Dauzon

Reputation: 11324

PHP : issue with idn_to_utf8(). Certain domains are not converted

In a PHP project I use the idn_to_utf8 function to convert domaine name from punycode to unicode string.

But sometimes this function return the punycode and not the unicode string.

Example :

echo idn_to_utf8('xn--fiq57vn0d561bf5ukfonh1o');
// Return : xn--fiq57vn0d561bf5ukfonh1o
// It should return : 中島第2駐輪場
echo idn_to_utf8('xn--fiqu6mnndw87c3ucbt0a1ea684a');
// Return : 中味鋺自転車置場

There are libraries which correctly convert punycode (http://idnaconv.phlymail.de/index.php?encoded=xn--fiq57vn0d561bf5ukfonh1o&decode=%3C%3C+Decode&lang=de) but I prefer use a PHP function than a library.

Do you have any ideas of origins of this problem ?

Edit / Solution and Explanation : To summarize and explain the problem : This code show the problem :

echo idn_to_ascii('吉津第2自転車置場');
?><br /><?php
echo idn_to_utf8(idn_to_ascii('吉津第2自転車置場'));
?> Should be : 吉津第2自転車置場 <br /><?php

This code displays the following :

xn--2-958a11kws1a96p50fgxenr6afga

吉津第2自転車置場 (Should be) : 吉津第2自転車置場

To be more clear : When we get the punycode of 吉津第2自転車置場, before convert this string PHP convert it to 吉津第2自転車置場 (The character "2" is different). So, with idn_to_ascii function we can't convert all unicode characters because PHP convert certain unicode character to others (in this example PHP converts 2 to 2 (sorry for sounding of this "two to "two").

Upvotes: 1

Views: 2047

Answers (2)

Matthew Slyman
Matthew Slyman

Reputation: 356

Without PECL/intl or PECL/idn, I had trouble getting the built-in idn_to_utf8() to work!

This alternative: IdnaConv.net, works well for me. Taking the domain name as a whole:

include(__DIR__.'/IdnaConvert.php');$IDNA=new \Mso\IdnaConvert\IdnaConvert();
$domain='xn--b1amarcd.xn--ehq889crwebw5c4qa.net';//'новини.三明治餐馆.net';
$parts=explode('.',$domain);$utf8parts=[];
foreach($parts AS $part){
    if(\substr($part,0,4)==='xn--'){
        $utf8parts[]=$IDNA->decode($part);
    }else{
        $utf8parts[]=$part;
}   }
$utf8domain=implode('.',$utf8parts);

Upvotes: 0

mpyw
mpyw

Reputation: 5754

This works fine. I think characters [A-Z0-9] cannot be used.

echo idn_to_utf8('xn--2-kq6aw43af1e4y9boczagup'); // 中島第2駐輪場

Factually, our chromes will automatically convert 中島第2駐輪場.com into 中島第2駐輪場.com before accessing.

UPDATED:
A normalization rule named NAMEPREP seems to be provided: https://www.nic.ad.jp/ja/dom/idn.html

UPDATED:
That seems to be invaild... Validation Result

Upvotes: 1

Related Questions