bbnn
bbnn

Reputation: 3602

How to check if the word is Japanese or English using PHP

I want to have different process for English word and Japanese word in this function

function process_word($word) {
   if($word is english) {
     /////////
   }else if($word is japanese) {
      ////////
   }
}

thank you

Upvotes: 13

Views: 16122

Answers (6)

Alec
Alec

Reputation: 9078

You could try Google's Translation API that has a detection function: http://code.google.com/apis/language/translate/v2/using_rest.html#detect-language

Upvotes: 3

Alix Axel
Alix Axel

Reputation: 154553

A quick solution that doesn't need the mb_string extension:

if (strlen($str) != strlen(utf8_decode($str))) {
    // $str uses multi-byte chars (isn't English)
}

else {
    // $str is ASCII (probably English)
}

Or a modification of the solution provided by @Alexander Konstantinov:

function isKanji($str) {
    return preg_match('/[\x{4E00}-\x{9FBF}]/u', $str) > 0;
}

function isHiragana($str) {
    return preg_match('/[\x{3040}-\x{309F}]/u', $str) > 0;
}

function isKatakana($str) {
    return preg_match('/[\x{30A0}-\x{30FF}]/u', $str) > 0;
}

function isJapanese($str) {
    return isKanji($str) || isHiragana($str) || isKatakana($str);
}

Upvotes: 27

Alexander Konstantinov
Alexander Konstantinov

Reputation: 5486

This function checks whether a word contains at least one Japanese letter (I found unicode range for Japanese letters in Wikipedia).

function isJapanese($word) {
    return preg_match('/[\x{4E00}-\x{9FBF}\x{3040}-\x{309F}\x{30A0}-\x{30FF}]/u', $word);
}

Upvotes: 22

Benoit
Benoit

Reputation: 3598

Try with mb_detect_encoding function, if encoding is EUC-JP or UTF-8 / UTF-16 it can be japanese, otherwise english. The better is if you can ensure which encoding each language, as UTF encodings can be used for many languages

Upvotes: 1

user253984
user253984

Reputation:

You can try to convert the charset and check if it succeeds.

Take a look at iconv: http://www.php.net/manual/en/function.iconv.php

If you can convert a string to ISO-8859-1 it might be english, if you can convert to iso-2022-jp it is propably japanese (I might be wrong for the exact charsets, you should google for them).

Upvotes: 0

Messa
Messa

Reputation: 25191

English text usually consists only of ASCII characters (or better say, characters in ASCII range).

Upvotes: 0

Related Questions