I need to implement a very crude language identification algorithm. In my world, there are only two languages: English and not-English. I have ArrayList and I need to determine if each String is likely in English or the other language which has its Unicode chars in a certain range. So what I want to do is to check each String against this range using some type of "presence" test. If it passes the test, I say the String is not English, otherwise it's English. I want to try two type of tests: TEST-ANY: If any char in the string falls within the range, the string passes the test TEST-ALL: If all chars in the string fall within the range, the string passes the test Since the array might be very long, I need to implement this very efficiently. What would be the fastest way of doing this in Java? Thx UPDATE: I am specifically checking for non-English by looking at a specific range of Unicodes rather then checking for whether the characters are ASCII, in part to take care of the "resume" problem mentioned below. What I am trying to figure out is whether Java provides any classes/methods that essentially implement TEST-ANY or TEST-ALL (or another similar test) as efficiently as possible. In other words, I am trying to avoid reinventing the wheel especially if the wheel invented before me is better anyway.

Reputation: 5927

Java: looking for the fastest way to check String for presence of Unicode chars in certain range

I need to implement a very crude language identification algorithm. In my world, there are only two languages: English and not-English. I have ArrayList and I need to determine if each String is likely in English or the other language which has its Unicode chars in a certain range. So what I want to do is to check each String against this range using some type of "presence" test. If it passes the test, I say the String is not English, otherwise it's English. I want to try two type of tests:

TEST-ANY: If any char in the string falls within the range, the string passes the test
TEST-ALL: If all chars in the string fall within the range, the string passes the test

Since the array might be very long, I need to implement this very efficiently. What would be the fastest way of doing this in Java?

Thx

UPDATE: I am specifically checking for non-English by looking at a specific range of Unicodes rather then checking for whether the characters are ASCII, in part to take care of the "resume" problem mentioned below. What I am trying to figure out is whether Java provides any classes/methods that essentially implement TEST-ANY or TEST-ALL (or another similar test) as efficiently as possible. In other words, I am trying to avoid reinventing the wheel especially if the wheel invented before me is better anyway.

Upvotes: 2

Answers (3)

user377628

Reputation:

I really don't think that this solution is ideal for determining language, but if you want to check to see if a string is all ascii, you could do something like this:

public static boolean isASCII(String s){
    boolean ret = true;
    for(int i = 0; i < s.length() ; i++) {
        if(s.charAt(i)>=128){
            ret = false;
            break;
        }
    }
    return ret;
}

So then if you try this:

boolean r = isASCII("Hello");

r would equal true. But if you try:

boolean r = isASCII("Grüß dich");

then r would equal false. I haven't tested performance, but this would work reasonably fast, because all it does is compare a character to the number 128.

But as @AlexanderPogrebnyak mentioned in the comments above, this will return false if you give it "résumé". Be aware of that.

Update:

I am specifically checking for non-English by looking at a specific range of Unicodes rather then checking for whether the characters are ASCII

But ASCII is a range in Unicode (well at least in UTF-8). Unicode is just an extension of ASCII. What the code @mP. and I provided does is it checks to see whether each character is in a certain range. I chose that range to be ASCII, which is any Unicode character that has a decimal value of less than 128. You can just as well choose any other range. But the reason I chose ASCII is because it's the one with the Latin alphabet, the Arabic numbers, and some other common characters that would normally be in an 'English' string.

Upvotes: 4

I Z

Reputation: 5927

Here's how I ended up implementing TEST-ANY:

// TEST-ANY
String str = "wordToTest";
int UrangeLow = 1234; // can get range from e.g. http://www.utf8-chartable.de/unicode-utf8-table.pl
int UrangeHigh = 2345;
for(int iLetter = 0; iLetter < str.length() ; iLetter++) {
   int cp = str.codePointAt(iLetter);
   if (cp >= UrangeLow && cp <= UrangeHigh) {
      // word is NOT English
      return;
   } 
}
// word is English
return;

Upvotes: 4

mP.

Reputation: 18266

public static boolean isAscii( String s ){
    int length = s.length;
    for( int i = 0; i < length; i++){
       final char c = s.charAt( i );
       if( c > 'z' ){
          return false;
       }
    }
    return true;
}

@Hassan thanks for picking the typo replaced test against big Z with little z.

Upvotes: 2

Java: looking for the fastest way to check String for presence of Unicode chars in certain range

Answers (3)

Update:

Related Questions