Reputation: 111
I THINK Latin characters are what I mean in my question, but I'm not entirely sure what the correct classification is. I'm trying to use a regex Pattern to test if a string contains non Latin characters. I'm expecting the following results
"abcDE 123"; // Yes, this should match
"!@#$%^&*"; // Yes, this should match
"aaàààäää"; // Yes, this should match
"ベビードラ"; // No, this shouldn't match
"😀😃😄😆"; // No, this shouldn't match
My understanding is that the built-in {IsLatin}
preset simply detects if any of the characters are Latin. I want to detect if any characters are not Latin.
Pattern LatinPattern = Pattern.compile("\\p{IsLatin}");
Matcher matcher = LatinPattern.matcher(str);
if (!matcher.find()) {
System.out.println("is NON latin");
return;
}
System.out.println("is latin");
Upvotes: 3
Views: 3240
Reputation: 626748
All Latin Unicode character classes are:
\p{InBasic_Latin}: U+0000–U+007F
\p{InLatin-1_Supplement}: U+0080–U+00FF
\p{InLatin_Extended-A}: U+0100–U+017F
\p{InLatin_Extended-B}: U+0180–U+024F
So, the answer is either
Pattern LatinPattern = Pattern.compile("^[\\p{InBasicLatin}\\p{InLatin-1Supplement}\\p{InLatinExtended-A}\\p{InLatinExtended-B}]+$");
Pattern LatinPattern = Pattern.compile("^[\\x00-\\x{024F}]+$"); //U+0000-U+024F
Note that underscores are removed from the Unicode property class names in Java.
See the Java demo:
List<String> strs = Arrays.asList(
"abcDE 123", // Yes, this should match
"!@#$%^&*", // Yes, this should match
"aaàààäää", // Yes, this should match
"ベビードラ", // No, this shouldn't match
"😀😃😄😆"); // No, this shouldn't match
Pattern LatinPattern = Pattern.compile("^[\\p{InBasicLatin}\\p{InLatin-1Supplement}\\p{InLatinExtended-A}\\p{InLatinExtended-B}]+$");
//Pattern LatinPattern = Pattern.compile("^[\\x00-\\x{024F}]+$"); //U+0000-U+024F
for (String str : strs) {
Matcher matcher = LatinPattern.matcher(str);
if (!matcher.find()) {
System.out.println(str + " => is NON Latin");
//return;
} else {
System.out.println(str + " => is Latin");
}
}
Note: if you replace .find()
with .matches()
, you can throw away ^
and $
in the pattern.
Output:
abcDE 123 => is Latin
!@#$%^&* => is Latin
aaàààäää => is Latin
ベビードラ => is NON Latin
😀😃😄😆 => is NON Latin
Upvotes: 2
Reputation: 159086
TL;DR: Use regex ^[\p{Print}\p{IsLatin}]*$
You want a regex that matches if the string consists of:
Easiest way is to combine \p{IsLatin}
with \p{Print}
, where Pattern
defines \p{Print}
as:
\p{Print}
- A printable character: [\p{Graph}\x20]
\p{Graph}
- A visible character: [\p{Alnum}\p{Punct}]
\p{Alnum}
- An alphanumeric character: [\p{Alpha}\p{Digit}]
\p{Alpha}
- An alphabetic character: [\p{Lower}\p{Upper}]
\p{Lower}
- A lower-case alphabetic character: [a-z]
\p{Upper}
- An upper-case alphabetic character: [A-Z]
\p{Digit}
- A decimal digit: [0-9]
\p{Punct}
- Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
\x20
- A space:
Which makes \p{Print}
the same as [\p{ASCII}&&\P{Cntrl}]
, i.e. ASCII characters that are not control characters.
The \p{Alpha}
part overlaps with \p{IsLatin}
, but that's fine, since the character class eliminates duplicates.
So, regex is: ^[\p{Print}\p{IsLatin}]*$
Test
Pattern latinPattern = Pattern.compile("^[\\p{Print}\\p{IsLatin}]*$");
String[] inputs = { "abcDE 123", "!@#$%^&*", "aaàààäää", "ベビードラ", "😀😃😄😆" };
for (String input : inputs) {
System.out.print("\"" + input + "\": ");
Matcher matcher = latinPattern.matcher(input);
if (! matcher.find()) {
System.out.println("is NON latin");
} else {
System.out.println("is latin");
}
}
Output
"abcDE 123": is latin
"!@#$%^&*": is latin
"aaàààäää": is latin
"ベビードラ": is NON latin
"😀😃😄😆": is NON latin
Upvotes: 4