Andreas Hunter
Andreas Hunter

Reputation: 5024

How to split a sentence in English Cyrillic and Cyrillic English?

I have sample text in english cyrillic letters:

“No,” the  old  man  said.” But we have .Haven’t we?” Бале , -гуфт  -Аммо мо бовар дорем . Дуруст”?  
“Yes ,”the boy said . Can I offer you a  beer on the  Terrace and then we’ll take the stuff home . 

 Албатта . Мехоҳӣ, ки дар каҳвахона  бароят оби ҷав  бигирам?  Баъд чизҳоро  ба хона  мебарем .  

“Why not ?”  the  old man said . “  Between fishermen.”  
Чаро  не ?! гуфт  пирамард .- Моҳигир моҳигириро метавонад  даъват кунад.

How I can get sample result from this text to array:

$englishCyrillic = [
   "No, the  old  man  said. But we have .Haven’t we?" => "Бале , -гуфт  -Аммо мо бовар дорем . Дуруст?",
   "Yes ,the boy said . Can I offer you a  beer on the  Terrace and then we’ll take the stuff home." => "Албатта . Мехоҳӣ, ки дар каҳвахона  бароят оби ҷав  бигирам?  Баъд чизҳоро  ба хона  мебарем.",
   "Why not ?  the  old man said . Between fishermen." => "Чаро  не ?! гуфт  пирамард .- Моҳигир моҳигириро метавонад  даъват кунад.",
];

And also I have Cyrillic English sentence type:

Куҷо дард мекунад?  Show me where it hurts?    
Нафас гиред / Нафас нагиред.    Breath / Do not breath     
Чуқуртар нафас гиред    Breathe deeply

How to get sample result from this text:

$cyrillicEnglish = [
   "Куҷо дард мекунад?" => "Show me where it hurts?",
   "Нафас гиред / Нафас нагиред." => "Breath / Do not breath",
   "Чуқуртар нафас гиред" => "Breathe deeply",
];

I tired with regex but my code can not split by sentence and return needed me result:

Search english words:

preg_match_all('/[\p{Latin}]+/u', $text, $matches);

Search cyrillic words:

preg_match_all('/[\p{Cyrillic}]+/u', $text, $matches);

Upvotes: 1

Views: 75

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

The strings in the first format can be read line by line, and all you need to do is to add the odd ones as English, and even ones as Cyrillic. No regex is required.

For the second format, you might use

preg_match('~(.*\p{Cyrillic}\S*)\h+(.+)~u', $s, $matches)

and the create the array:

array_combine($matches[1], $matches[2])

See the second regex demo

Upvotes: 1

Related Questions