Ramon
Ramon

Reputation: 434

PHP Japanese string comparison with Unicode

I've seen multiple topics with this problem but none of them deal with it in PHP. I have to find a string in a database. The problem is that the string I have to look for is Japanese encoded and doesn't match with the database entries even if they are equal.

Search string:

Free!

String in database:

Free!

Edit: Both strings are encoded in UTF-8. You clearly can see the difference between them. Is there a possibility to recognize equality from these two strings?

If there is no programmatic way to solve the problem, does anyone know a character database which I can use to convert the string manually?

Upvotes: 0

Views: 1174

Answers (1)

Morris Miao
Morris Miao

Reputation: 730

Try use this function (or a similar idea...) to convert the fullwidth (the "Japanese") letters to halfwidth (which is the normal letters we see everyday) first. Hope this helps. :)

function makeSemiWidth($str) 
{ 
$arr = array('0' => '0', 
             '1' => '1', 
             '2' => '2', 
             '3' => '3', 
             '4' => '4', 
             '5' => '5',  
             '6' => '6',  
             '7' => '7',  
             '8' => '8',  
             '9' => '9',  
             'A' => 'A',  
             'B' => 'B',  
             'C' => 'C',  
             'D' => 'D',  
             'E' => 'E', 
             'F' => 'F',  
             'G' => 'G',  
             'H' => 'H',  
             'I' => 'I',  
             'J' => 'J',  
             'K' => 'K',  
             'L' => 'L',  
             'M' => 'M',  
             'N' => 'N',  
             'O' => 'O', 
             'P' => 'P',  
             'Q' => 'Q',  
             'R' => 'R',  
             'S' => 'S',  
             'T' => 'T', 
             'U' => 'U',  
             'V' => 'V',  
             'W' => 'W',  
             'X' => 'X',  
             'Y' => 'Y', 
             'Z' => 'Z',  
             'a' => 'a',  
             'b' => 'b',  
             'c' => 'c',  
             'd' => 'd', 
             'e' => 'e',  
             'f' => 'f',  
             'g' => 'g',  
             'h' => 'h',  
             'i' => 'i', 
             'j' => 'j',  
             'k' => 'k',  
             'l' => 'l',  
             'm' => 'm',  
             'n' => 'n', 
             'o' => 'o',  
             'p' => 'p',  
             'q' => 'q',  
             'r' => 'r',  
             's' => 's',  
             't' => 't',  
             'u' => 'u',  
             'v' => 'v',  
             'w' => 'w',  
             'x' => 'x',  
             'y' => 'y',  
             'z' => 'z', 
             '(' => '(',  
             ')' => ')',  
             '〔' => '[',  
             '〕' => ']',  
             '【' => '[', 
             '】' => ']',  
             '〖' => '[',  
             '〗' => ']',  
             '“' => '[',  
             '”' => ']', 
             '‘' => '[',  
             '\'' => ']',  
             '{' => '{',  
             '}' => '}',  
             '《' => '<', 
             '》' => '>', 
             '%' => '%',  
             '+' => '+',  
             '—' => '-',  
             '-' => '-',  
             '~' => '-', 
             ':' => ':',  
             '。' => '.',  
             '、' => ',',  
             ',' => '.',  
             '、' => '.',  
             ';' => ',',  
             '?' => '?',  
             '!' => '!',  
             '…' => '-',  
             '‖' => '|',  
             '”' => '"',  
             '\'' => '`',  
             '‘' => '`',  
             '|' => '|',  
             '〃' => '"','  
             ' => ' '); 
return strtr($str, $arr); 
} 

Or, you may want to try convert them in the opposite way as well, this function will be able to convert from fullwidth (the "Japanese") to halfwidth (our English), AND from halfwidth to fullwidth.

<?PHP
function makeSemiWidth($str,$args2=1) { //halfwidth <-> fullwidth conversion function, set the 2nd parameter to 0 for converting halfwidth (English) to fullwidth (Japanese); set it to 1 for converting fullwidth to halfwidth
$DBC = Array(
'0' , '1' , '2' , '3' , '4' , 
'5' , '6' , '7' , '8' , '9' ,
'A' , 'B' , 'C' , 'D' , 'E' , 
'F' , 'G' , 'H' , 'I' , 'J' ,
'K' , 'L' , 'M' , 'N' , 'O' , 
'P' , 'Q' , 'R' , 'S' , 'T' ,
'U' , 'V' , 'W' , 'X' , 'Y' , 
'Z' , 'a' , 'b' , 'c' , 'd' ,
'e' , 'f' , 'g' , 'h' , 'i' , 
'j' , 'k' , 'l' , 'm' , 'n' ,
'o' , 'p' , 'q' , 'r' , 's' , 
't' , 'u' , 'v' , 'w' , 'x' ,
'y' , 'z' , '-' , ' '  , ':' ,
'.' , ',' , '/' , '%' , '#' ,
'!' , '@' , '&' , '(' , ')' ,
'<' , '>' , '"' , ''' , '?' ,
'[' , ']' , '{' , '}' , '\' ,
'|' , '+' , '=' , '_' , '^' ,
'¥' , ' ̄' , '`'
);
$SBC = Array( //halfwidth
'0', '1', '2', '3', '4', 
'5', '6', '7', '8', '9',
'A', 'B', 'C', 'D', 'E', 
'F', 'G', 'H', 'I', 'J',
'K', 'L', 'M', 'N', 'O', 
'P', 'Q', 'R', 'S', 'T',
'U', 'V', 'W', 'X', 'Y', 
'Z', 'a', 'b', 'c', 'd',
'e', 'f', 'g', 'h', 'i', 
'j', 'k', 'l', 'm', 'n',
'o', 'p', 'q', 'r', 's', 
't', 'u', 'v', 'w', 'x',
'y', 'z', '-', ' ', ':',
'.', ',', '/', '%', '#',
'!', '@', '&', '(', ')',
'<', '>', '"', '\'','?',
'[', ']', '{', '}', '\\',
'|', '+', '=', '_', '^',
'$', '~', '`'
);
if($args2==0)
return str_replace($SBC,$DBC,$str);  //halfwidth -> fullwidth
if($args2==1)
return str_replace($DBC,$SBC,$str);  //fullwidth -> halfwidth
else
return false;
}
/*
$str = "alskdf";
echo $str;
echo "<br>";
echo makeSemiWidth($str,0);
echo makeSemiWidth($str,1);
*/
?>

You may also want to use Regular Expression to do this,

$str = preg_replace('/\xa3([\xa1-\xfe])/e', 'chr(ord(\1)-0x80)', $str);

\xa3[\xa1-\xfe] represents for GB2312 fullwidth (the "Japanese") character set, we take it and deduct the 2nd byte by 0x80 (128 in decimal), that gives the corresponding halfwidth character (our normal English).

However, with UTF-8 encoding, it will NOT work sometimes. Therefore we will need to convert it to GBK first, to do so, use the code below,

$str = iconv('utf-8', 'gbk//IGNORE', $str); 

//IGNORE is used for ignoring some special funny characters exists in UTF-8 but not in GBK.

So if we put them together, the result will come.

Upvotes: 1

Related Questions