Reputation: 10662
I found a few references to regex filtering out non-English but none of them is in Java, aside from the fact that they are all referring to somewhat different problems than what I am trying to solve:
true
if a string contains any non-English
character.By "English text" I mean not only actual letters and numbers but also punctuation.
So far, what I have been able to come with for goal #1 is quite simple:
String.replaceAll("\\W", " ")
In fact, so simple that I suspect that I am missing something... Do you spot any caveats in the above?
As for goal #2, I could simply trim()
the string after the above replaceAll()
, then check if it's empty. But... Is there a more efficient way to do this?
Upvotes: 2
Views: 16086
Reputation: 3868
This works for me
private static boolean isEnglish(String text) {
CharsetEncoder asciiEncoder = Charset.forName("US-ASCII").newEncoder();
CharsetEncoder isoEncoder = Charset.forName("ISO-8859-1").newEncoder();
return asciiEncoder.canEncode(text) || isoEncoder.canEncode(text);
}
Upvotes: 3
Reputation: 230
Here is my solution. I assume the text may contain English words, punctuation marks and standard ascii symbols such as #, %, @ etc.
private static final String IS_ENGLISH_REGEX = "^[ \\w \\d \\s \\. \\& \\+ \\- \\, \\! \\@ \\# \\$ \\% \\^ \\* \\( \\) \\; \\\\ \\/ \\| \\< \\> \\\" \\' \\? \\= \\: \\[ \\] ]*$";
private static boolean isEnglish(String text) {
if (text == null) {
return false;
}
return text.matches(IS_ENGLISH_REGEX);
}
Upvotes: 3
Reputation: 359816
In fact, so simple that I suspect that I am missing something... Do you spot any caveats in the above?
\W
is equivalent to [^\w]
, and \w
is equivalent to [a-zA-Z_0-9]
. Using \W
will replace everything which isn't a letter, a number, or an underscore — like tabs and newline characters. Whether or not that's a problem is really up to you.
By "English text" I mean not only actual letters and numbers but also punctuation.
In that case, you might want to use a character class which omits punctuation; something like
[^\w.,;:'"]
Create a method that returns true if a string contains any non-English character.
Pattern p = Pattern.compile("\\W");
boolean containsSpecialChars(String string)
{
Matcher m = p.matcher(string);
return m.find();
}
Upvotes: 5
Reputation: 274612
Assuming an english word is made up of characters from: [a-zA-Z_0-9]
To return true if a string contains any non-English character, use string.matches
:
return !string.matches("^\\w+$");
Upvotes: 0