Vlad Holubiev
Vlad Holubiev

Reputation: 5154

RegEx for ukrainian letters. How to separate cyrillic words by capital letter?

I have a String with some cyrillic words inside. Each starts with a capital letter.

var str = 'ХєлпМіПліз';

I have found this solution str.match(/[А-Я][а-я]+/g).

But it returns me ["Пл"] insted of ["Хєлп", "Мі", "Пліз"]. Seems like it doesn't recognize ukrainian letters('і', 'є'), only russian.

So, How do I have to change that regex to include ukrainian letters?

Upvotes: 11

Views: 15261

Answers (8)

doogan
doogan

Reputation: 11

Try the pattern below:

^[А-ЩЬЮЯҐЄІЇ][а-щьюяґєії']*$

Upvotes: 0

Oshchenkov
Oshchenkov

Reputation: 31

Only Ukrainian, without Russian

[бвгґджзклмнпрстфхцчшщйаеєиіїоуюяь]/gi

Upvotes: 3

DL-Newbie
DL-Newbie

Reputation: 146

works with Ukrainian letters 'i' and others

python
r's/[^а-яА-Я.!?]/./g+' 

Upvotes: 2

daubmannus
daubmannus

Reputation: 518

[А-Я] is not Cyrillic alphabet, it's just Russian!

Cyrillic is a writing system. It used in alphabets for many languages. (Like Latin: charset for West European languages, East European &c.)

To have both Russian and Ukrainian you'd get [А-ЯҐЄІЇ].

To add Belarisian: [А-ЯҐЄІЇЎ]

And for all Cyrillic chars (including Balcanian languages and Old Cyrillic), you can get it through Unicode subset class, like: \p{IsCyrillic}


To deal with Ukrainian separately:

[А-ЩЬЮЯҐЄІЇ] or [А-ЩЬЮЯҐЄІЇа-щьюяґєії] seems to be full Ukrainian alphabet of 33 letters in each case.

Apostrophe is not a letter, but occasionally included in alphabet, because it has an impact to the next vowel. Apostrophe is a part of the words, not divider. It may be displayed in a few ways:

27 "'" APOSTROPHE
60 "`" GRAVE ACCENT
2019 "’" RIGHT SINGLE QUOTATION MARK
2bc "ʼ" MODIFIER LETTER APOSTROPHE

and maybe some more.

Yes, it's a bit complicated with apostrophe. There is no common standard for it.

Upvotes: 38

Purkhalo Alex
Purkhalo Alex

Reputation: 3627

Ukranian alphabet has four different words from the cyrillic alphabet, such as: [і, є, ї, ґ], also it can contain a single quote inside

"ґуля, з'їсти, істота, Європа".match(/[а-яієїґ\']+/ig)

i by the and will match the upper case, like with "Європа"

Upvotes: 9

Slavkó Medvediev
Slavkó Medvediev

Reputation: 1601

Use \p{Lu} for uppercase match, \p{Ll} for lowercase, or \p{L} to match any letter

update: That works only for Java, not for JavaScript. Don't forget to include "apostrof", "ji" to your regexp

Upvotes: 12

ProdoElmit
ProdoElmit

Reputation: 1067

[А-Я][а-я] really doesn't include ukranian letters.

While 'я' is \u044f, 'є' is \u0454 and 'i' is \u0456 (\u0404 for Є ) . You should include them in regex by hand:

/[А-ЯЄI][а-яєi]+/g

Upvotes: 4

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89574

The way to solve this is to look at the unicode table to determine the character ranges you need. If, for example, I use the pattern:

str.match(/[А-Я][а-яєі]+/g)

it works with your example string. (sorry i don't know ukrainian letters)

Upvotes: 4

Related Questions