Reputation: 693
Sorry for the ambiguous subject, what I'm looking for is to have a string with cyrillic characters that may go like
«Добрый день!» - сказал он, потянувшись…
into an array that goes like
[0] => «
[1] => Добрый␠
[2] => день!»␠-␠
[3] => сказал␠
[4] => он,␠
[5] => потянувшись…
So essentially I'm looking for a break to occur on a border between any character and a cyrillic character ([а-я] range) although this must only be true when we transit from any character to a cyrillic character, not vice versa. I've seen examples that successfully solve this with punctuation characters and latin alphabet with
preg_split('/([^.:!?]+[.:!?]+)/', 'hello:there.everyone!so.how?are:you', NULL, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY );
but my attempts to repurpose it into something different have so far failed:
preg_split ('/(?<=[^а-я])/ius', $text, NULL, PREG_SPLIT_NO_EMPTY);
almost works but it also splits by regular characters such as spaces and punctuation marks and that is not what I want. Clearly there's something wrong with my regex. How should I modify that to get the result as in the example above?
Upvotes: 2
Views: 1043
Reputation: 18535
How about splitting at an initial \b
word boundary with u
modifier.
$res = preg_split('/\b(?=\w)(?!^)/u', $str);
The lookahead ensures \b
is followed by a word character. (?!^)
prevents empty match if start.
Upvotes: 2
Reputation: 2582
You have to check also with a look ahead if the next character is a cyrrilic one. This code will do the job:
$t = preg_split ('/(?<=[^а-я])(?=[а-я]+)/ius', $text, NULL, PREG_SPLIT_NO_EMPTY);
It gives this output:
Array
(
[0] => «
[1] => Добрый
[2] => день!» -
[3] => сказал
[4] => он,
[5] => потянувшись…
)
Upvotes: 1
Reputation: 627103
Use the following regex solution:
$s = "«Добрый день!» - сказал он, потянувшись…";
$res = preg_split('/\b(\p{Cyrillic}+\W*)/u', $s, NULL, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($res);
// Array(
// [0] => «
// [1] => Добрый
// [2] => день!» -
// [3] => сказал
// [4] => он,
// [5] => потянувшись…
//)
See the PHP demo
Details:
\b(\p{Cyrillic}+\W*)
- matches and captures a whole Cyrillic word with 0+ non-word chars after itPREG_SPLIT_DELIM_CAPTURE
will push the captured values into the resulting arrayPREG_SPLIT_NO_EMPTY
will discard empty values in the array/u
modifier will make the \b
(word boundary) and \W
Unicode aware, and will allow processing Unicode strings with regex.Upvotes: 2
Reputation: 7111
Try this regex: [\x{0400}-\x{04FF}]*[^\x{0400}-\x{04FF}]*
. All unicode characters from 0400 to 04FF are considered as cyrillic. It should match exactly what you want. You can also replace \x{0400}-\x{04FF}
with \p{Cyrillic}
as suggested in another answer.
This is all the characters in that range:
ЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏ0АБВГДЕЖЗИЙКЛМНОП0РСТУФХЦЧШЩЪЫЬЭЮЯ0абвгдежзийклмнопрстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџ0460ѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂҃҄҅҆҇҈҉ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿ04C0ӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏ04D0ӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯ04F0ӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿ
Upvotes: 0