Reputation: 5927
I need to implement a very crude language identification algorithm. In my world, there are only two languages: English and not-English. I have ArrayList and I need to determine if each String is likely in English or the other language which has its Unicode chars in a certain range. So what I want to do is to check each String against this range using some type of "presence" test. If it passes the test, I say the String is not English, otherwise it's English. I want to try two type of tests:
Since the array might be very long, I need to implement this very efficiently. What would be the fastest way of doing this in Java?
Thx
UPDATE: I am specifically checking for non-English by looking at a specific range of Unicodes rather then checking for whether the characters are ASCII, in part to take care of the "resume" problem mentioned below. What I am trying to figure out is whether Java provides any classes/methods that essentially implement TEST-ANY or TEST-ALL (or another similar test) as efficiently as possible. In other words, I am trying to avoid reinventing the wheel especially if the wheel invented before me is better anyway.
Upvotes: 2
Views: 4248
Reputation:
I really don't think that this solution is ideal for determining language, but if you want to check to see if a string is all ascii, you could do something like this:
public static boolean isASCII(String s){
boolean ret = true;
for(int i = 0; i < s.length() ; i++) {
if(s.charAt(i)>=128){
ret = false;
break;
}
}
return ret;
}
So then if you try this:
boolean r = isASCII("Hello");
r
would equal true. But if you try:
boolean r = isASCII("Grüß dich");
then r
would equal false. I haven't tested performance, but this would work reasonably fast, because all it does is compare a character to the number 128.
But as @AlexanderPogrebnyak mentioned in the comments above, this will return false if you give it "résumé". Be aware of that.
I am specifically checking for non-English by looking at a specific range of Unicodes rather then checking for whether the characters are ASCII
But ASCII is a range in Unicode (well at least in UTF-8). Unicode is just an extension of ASCII. What the code @mP. and I provided does is it checks to see whether each character is in a certain range. I chose that range to be ASCII, which is any Unicode character that has a decimal value of less than 128. You can just as well choose any other range. But the reason I chose ASCII is because it's the one with the Latin alphabet, the Arabic numbers, and some other common characters that would normally be in an 'English' string.
Upvotes: 4
Reputation: 5927
Here's how I ended up implementing TEST-ANY:
// TEST-ANY
String str = "wordToTest";
int UrangeLow = 1234; // can get range from e.g. http://www.utf8-chartable.de/unicode-utf8-table.pl
int UrangeHigh = 2345;
for(int iLetter = 0; iLetter < str.length() ; iLetter++) {
int cp = str.codePointAt(iLetter);
if (cp >= UrangeLow && cp <= UrangeHigh) {
// word is NOT English
return;
}
}
// word is English
return;
Upvotes: 4
Reputation: 18266
public static boolean isAscii( String s ){
int length = s.length;
for( int i = 0; i < length; i++){
final char c = s.charAt( i );
if( c > 'z' ){
return false;
}
}
return true;
}
@Hassan thanks for picking the typo replaced test against big Z with little z.
Upvotes: 2