kizanlik
kizanlik

Reputation: 175

RegEx: Split a string by another string, including diacritics

I've been trying to split a string by another string via RegEx.Split() method in C#. Either data or splitter can have diacritics.

Let me give you an example:

Data: education

Splitter:

Expected result: e / du / cation

--or--

Data: èdùcation

Splitter: ed

Expected result: èd / ùcation

Is it possible? If it is, could you help me for writing the pattern?

Upvotes: 0

Views: 151

Answers (1)

Richard
Richard

Reputation: 109045

There is no option in .NET's regular expression engine to "ignore diacritics", however it might be possible to work around it by making use of Unicode normal form-D (for "decomposed"). This is untested.

Accented characters can be represented in two ways:

  • As single pre-composed code points. Eg. U+00F9 (Latin Small Letter U with Grave).
  • As a base code point followed by one or more combining characters. Eg. U+0075, U-0300 (Latin Small Letter U, Combining Grave Accent).

Thus if you ensure the input data is decomposed (use String.Normalise(normalization) passing NormalizationForm.FormD), and that any potentially accented character in the pattern is replaced by

B\p{Mc}*

a base character B followed by zero or more code points in Unicode category "Mark, Spacing Combining".

To include the text that matches the regex in the output make it capturing, so to match and capture both du and use (du\p{Mc}*).

Upvotes: 1

Related Questions