Reputation: 11283
What code snippets are out there for detecting the language of a chunk of UTF-8 text? I basically need to filter a large amount of spam that happens to be in Chinese and Arabic. There's a PECL extension for that, but I want to do this purely in PHP code. I guess I need to loop through a Unicode string with a unicode version of ord() and then create some kind of a range table for different languages.
Upvotes: 3
Views: 3520
Reputation: 2697
Arabic characters are mainly in Unicode range 0600–06FF. Unicode has a few supplements etc. for Arabic. For example, the code range 0750–077F contains mainly Arabic characters that are used primarily in some African languages. The code range 08A0–08FF cover some more letters for African languages, for European and Central Asian languages, Pakistani Quranic marks, etc. The two other Unicode code ranges for Arabic, FB50–FDFF and FE70–FEFF are probably less important if you already cover 0600–06FF.
Characters for Chinese (and Japanese and Korean) are registered in a different Unicode range (with several exentsions). The most important one is 4E00–9FD5. Assuming you don't need to worry about Japanese, this should be sufficient for script detection, but if you want to check the extensions, check the Unicode Consortium's list of code charts.
So if you need to filter only Arabic and Chinese scripts and don't want to use the approach suggested by troelskn (i.e. using lists of common words for the languages that you want to identify - this does not scale too well for a large number of languages), detecting the code range of the characters in your input should be sufficient. StackOverflow has already solved an earlier question about how to detect Unicode ranges in PHP.
Upvotes: 0
Reputation: 625087
Pipe your text through Google's language detection. You can do this via AJAX. Here is the documentation/developer's guide. For example:
<html>
<head>
<script type="text/javascript" src="http://www.google.com/jsapi"></script>
<script type="text/javascript">
google.load("language", "1");
function initialize() {
var text = document.getElementById("text").innerHTML;
google.language.detect(text, function(result) {
if (!result.error && result.language) {
google.language.translate(text, result.language, "en",
function(result) {
var translated = document.getElementById("translation");
if (result.translation) {
translated.innerHTML = result.translation;
}
});
}
});
}
google.setOnLoadCallback(initialize);
</script>
</head>
<body>
<div id="text">你好,很高興見到你。</div>
<div id="translation"></div>
</body>
</html>
Upvotes: 4
Reputation: 117497
The simplest approach is probably to have a dictionary of common words in different languages and then test how many positive matches you get against each language. It's a rather costly (computation-wise) task though.
Upvotes: 0
Reputation: 655239
You could translate the UTF-8 string into its Unicode code points and look for “suspicious ranges”.
function utf8ToUnicode($utf8)
{
if (!is_string($utf8)) {
return false;
}
$unicode = array();
$mbbytes = array();
$mblength = 1;
$strlen = strlen($utf8);
for ($i = 0; $i < $strlen; $i++) {
$byte = ord($utf8{$i});
if ($byte < 128) {
$unicode[] = $byte;
} else {
if (count($mbbytes) == 0) {
$mblength = ($byte < 224) ? 2 : 3;
}
$mbbytes[] = $byte;
if (count($mbbytes) == $mblength) {
if ($mblength == 3) {
$unicode[] = ($mbbytes[0] & 15) * 4096 + ($mbbytes[1] & 63) * 64 + ($mbbytes[2] & 63);
} else {
$unicode[] = ($mbbytes[0] & 31) * 64 + ($mbbytes[1] & 63);
}
$mbbytes = array();
$mblength = 1;
}
}
}
return $unicode;
}
Upvotes: 2