I am trying to use a regex such as [ăâîșțĂÂÎȘȚ] to match for Romanian alphabet diacritics(ISO 8859-16/Windows-1250). The problem is that the regex would also match the regex for a,i,s,t,A,I,S,T(the Latin alphabet corresponding characters for the above mentioned diacritics) and I don't want this. I didn't try to compare strings character by character because of performance time. Is there anyway I can make the regex match exactly for these characters?

Java regex matches diacritics for the Latin corresponding characters

Reputation: 109547

As already mentioned in Unicode one has two alternatives.

'\u0061'    'a'   LATIN SMALL LETTER A
'\u0300'     ̀     COMBINING GRAVE ACCENT

or

'\u00E0'    'à'   LATIN SMALL LETTER A WITH GRAVE

There is a Normalizer that can "normalize" to either form (and deal with ligatures):

String regex = "(?u)[ăâîșțĂÂÎȘȚ]";
regex = Normalizer.normalize(regex, Form.NFC); // Composed form
Pattern pattern = Pattern.compile(regex);

Using "(?u)" or a flag with Pattern.compile with UNICODE flag might already solve the problem. But using the Unicode variant without separate latin ('a') will certainly do.

The normalizer should especially be applied on the searched-through string.

Upvotes: 0

user557597

Reputation:

If your regex exists as literal rendered text, it has already been combined
and should exist as a different code point.

000074    t    LATIN SMALL LETTER T
+
000326    ̦    COMBINING COMMA BELOW
=
00021B    ț    LATIN SMALL LETTER T WITH COMMA BELOW

Just incase, you should use a hex codepoint to represent them ie. u\021B

Is it possible the Java engine could be stripping the combining character off of the regex?
Where x21B becomes x74? Might be that.

Meanwhile if you expect the letters in the source are not rendered, you could
use a regex like \p{Script=Latin}\p{Block=Combining_Diacritical_Marks}
to get those.

updated info :
While searching around for a defacto solution, I came across this Java info
from http://www.regular-expressions.info/unicode.html.

In Java, the regex token \uFFFF only matches the specified code point, even when you turned on canonical equivalence. However, the same syntax \uFFFF is also used to insert Unicode characters into literal strings in the Java source code. Pattern.compile("\u00E0") will match both the single-code-point and double-code-point encodings of à, while Pattern.compile("\u00E0") matches only the single-code-point version. Remember that when writing a regex as a Java string literal, backslashes must be escaped. The former Java code compiles the regex à, while the latter compiles \u00E0. Depending on what you're doing, the difference may be significant.

So, by entering a duality literal inside a class, it looks like Pattern.compile("[à]")
will actually match

000061    a    LATIN SMALL LETTER A
or
000300    ̀    COMBINING GRAVE ACCENT
or
0000E0    à    LATIN SMALL LETTER A WITH GRAVE

This smacks of the same problem when putting surrogate pairs inside classes.
There is a solution.

Avoid entering those literals inside of a class.
Instead, put them as a series of alternations
(?:à|_|_|_)

Doing this forces it to match either

000061    a    LATIN SMALL LETTER A
000300    ̀    COMBINING GRAVE ACCENT

or

0000E0    à    LATIN SMALL LETTER A WITH GRAVE

It won't match a independent of the grave like you see now.

Note - If you just use a "[\\u00E0]" you'd miss the a + grave.
which is valid.

Upvotes: 2

jakeehoffmann

Reputation: 1419

I believe this is happening because those characters are being treated as two Unicode code points. I would recommend trying to specifically match the code points using syntax like \uFFFF where FFFF is the code point. The exact syntax will depend on the regex implementation you are using.

Keep in mind that Unicode characters can be encoded as single code points or as multiple, so you'll want to account for that. Example: à encoded as U+0061 U+0300 and also U+00E0.

I hope this helps!

Upvotes: 0

Java regex matches diacritics for the Latin corresponding characters

Answers (3)

Related Questions