Reputation: 2172
I have found various of solutions for this problem. One solutions is to install a MYSQL plugin called mysqlcft. However, this solution will not work since the web hosting of my current tiny project do not support MYSQL plugin. Any alternative solutions?
Upvotes: 4
Views: 4704
Reputation: 2172
Tried to solve the problem by RandomSeed's suggestion "FULLTEXT and Asian Languages with MySQL 5.0". But the problem is that you cannot perform a 2 characters search unless you set "ft_min_word_len" to 2. Again, $1/month web hosting service do not allow you to do that.
Alright, spent 1 night to work out a work-around solution. The concept is to convert all non-alphabet characters of a UTF-8 string into some unique codes.
Here is the magic function. Borrowed from CSDN forum and made some changes.
function UTF2UCS($str, $s) {
$str = strtolower($str);
$char = 'UTF-8';
$arr = array();
$out = "";
$c = mb_strlen($str,$char);
$t = false;
for($i =0;$i<$c;$i++){
$arr[]=mb_substr($str,$i,1,$char);
}
foreach($arr as $i=>$v){
if(preg_match('/\w/i',$v,$match)){
$out .= $v;
$t = true;
}else{
if($t) $out .= " ";
if(isset($s) && $s) $out .= "+";
$out .= bin2hex(iconv("UTF-8","UCS-2",$v))." ";
$t = false;
}
}
return $out;
}
The result of
echo UTF2UCS("測試haha")
will be "6e2c 8a66 haha"
Say you have a field called "song_name". You just need to convert all song names by UTF2UCS function, then save those encrypted strings in a fulltext index field eg."song_name_ucs".
The next time you need to search something, all you need to do is:
$temp_string = UTF2UCS('測試', true);
SELECT * FROM song WHERE MATCH (song_name_ucs) AGAINST ('$temp_string')
Remember to put a true in UTF2UCS's second parameter when you need both '測' and '試' appears in the search result.
This might not be the best solution, but it does not requires any plugin or changes to the system. Pure PHP code.
Upvotes: 4
Reputation: 29759
The main problem is that:
Ideographic languages such as Chinese and Japanese do not have word delimiters. Therefore, [MySQL's] FULLTEXT parser cannot determine where words begin and end in these and other such languages.
Workarounds have been suggested to manually introduce artificial word separators (see: FULLTEXT and Asian Languages with MySQL 5.0). While I know nothing about Asian languages, I believe the problem of separating words is not trivial:
The problem of word separation for CJK is nothing new, so I'd advise you check to see if there is already some library, function, etc. that you can use/adapt for your application for this purpose.
I doubt that such a preprocessing can be done efficiently in MySQL only (this FAQ seems to concur), so you would probably need to handle this process at application level.
I would advise looking into a dedicated full-text indexing solution. It seems to be possible to configure Sphinx or Lucene for this purpose.
Upvotes: 2