Reputation: 175
I've been trying to split a string by another string via RegEx.Split()
method in C#.
Either data or splitter can have diacritics.
Let me give you an example:
Data: education
Splitter: dù
Expected result: e
/ du
/ cation
--or--
Data: èdùcation
Splitter: ed
Expected result: èd
/ ùcation
Is it possible? If it is, could you help me for writing the pattern?
Upvotes: 0
Views: 151
Reputation: 109045
There is no option in .NET's regular expression engine to "ignore diacritics", however it might be possible to work around it by making use of Unicode normal form-D (for "decomposed"). This is untested.
Accented characters can be represented in two ways:
Thus if you ensure the input data is decomposed (use String.Normalise(normalization)
passing NormalizationForm.FormD
), and that any potentially accented character in the pattern is replaced by
B\p{Mc}*
a base character B
followed by zero or more code points in Unicode category "Mark, Spacing Combining".
To include the text that matches the regex in the output make it capturing, so to match and capture both du
and dù
use (du\p{Mc}*)
.
Upvotes: 1