Reputation: 354

split string with preg_split on english (and non english letters)

I want to separate my sentence(s) into two parts. Because they are made of English letters and non english letters. I have regex I am using in preg_split method to get normal letters and characters. This though, works for opposite and I am left with only Japanese and not english.

String I work with:

すぐに諦めて昼寝をするかも知れない。  I may give up soon and just nap instead.

My attempt:

    $parts = preg_split("/[ -~]+$/", $cleanline); // $cleanline is the string above
            print_r($parts);

My result

Array ( [0] => すぐに諦めて昼寝をするかも知れない。   [1] => )

As you can see, I do get an empty second value. How can I get both the English and the non-English text into two different strings? Why is the English text not returning even if I use correct regex (from what I've been testing)?

Upvotes: 2

Answers (3)

Toto

Reputation: 91430

You could use lookaround to split on boundary between non alphabetic and alphabetic + space

$str = 'すぐに諦めて昼寝をするかも知れない。  I may give up soon and just nap instead.';
$parts = preg_split("/(?<=[^a-z])(?=[a-z\h])|(?<=[a-z\h])(?=[^a-z])/i", $str, 2);
print_r($parts);

Output:

Array
(
    [0] => すぐに諦めて昼寝をするかも知れない。
    [1] =>   I may give up soon and just nap instead.
)

Upvotes: 2

Ibrahim

Reputation: 6088

If you have two spaces between the two strings as shown in your example, you can split them easily with a simple \s{2} :

<?php
$s = "すぐに諦めて昼寝をするかも知れない。  I may give up soon and just nap instead.";
$s = preg_split("/\s{2}/", $s);
print_r($s);
?>

Output:

Array
(
    [0] => すぐに諦めて昼寝をするかも知れない。
    [1] => I may give up soon and just nap instead.
)

Demo: http://ideone.com/uD2W1Q

Upvotes: 2

Arif Acar

Reputation: 1571

try mb_split instead of preg_split function.

mb_regex_encoding('UTF-8');
mb_internal_encoding("UTF-8"); 
$parts = mb_split("/[ -~]+$/", $cleanline);

Upvotes: 2

split string with preg_split on english (and non english letters)

Answers (3)

Related Questions