Daan
Daan

Reputation: 3348

preg_split in unicode mode: delim_capture not working?

I'm trying to use a regex to split a chunk of Chinese text into sentences. For my purposes, sentence delimiters are:

Now, let's say my $str is this: $str = "你好。你好吗? 我是程序员,不太懂这个我问题,希望大家能够帮忙!一起加油吧!";

I use preg_split with these parameters:

$str2 = preg_split("/([\x{3002}\x{FF01}\x{FF1F}])/u",$str,PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);

$str2 is now an array that looks like this:

array(3) { [0]=> string(6) "你好" [1]=> string(9) "你好吗" [2]=> string(91) " 我是程序员,不太懂这个我问题,希望大家能够帮忙!一起加油吧!" }

However, the expected output is:

[0] "你好。" 
[1] "你好吗?"
[2] "我是程序员,不太懂这个我问题,希望大家能够帮忙!"
[3] "一起加油吧!"

As you can see, there are two problems: this does not process exclamation marks properly, and second, my fullwidth full stop and fullwidth question marks vanish. I'd expect delim_capture to keep them. I've been looking at this code for so long I can't possibly figure out what the problem is anymore. I would very much appreciate suggestions.

Upvotes: 3

Views: 1694

Answers (2)

anubhava
anubhava

Reputation: 785246

Your regex code should be like this to be able to capture string + delimiter:

$str = "你好。你好吗? 我是程序员,不太懂这个我问题,希望大家能够帮忙!一起加油吧!";
$arr = preg_split("/\s*([^\x{3002}\x{FF01}\x{FF1F}]+[\x{3002}\x{FF01}\x{FF1F}]\s*)/u",
                  $str, 0, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY );
var_dump($arr);

OUTPUT:

 array(4) {
  [0]=> string(9)  "你好。"
  [1]=> string(13) "你好吗? "
  [2]=> string(72) "我是程序员,不太懂这个我问题,希望大家能够帮忙!"
  [3]=> string(18) "一起加油吧!"
}

Upvotes: 4

Wiseguy
Wiseguy

Reputation: 20883

You're missing the $limit parameter to preg_split().

array preg_split ( string $pattern , string $subject [, int $limit = -1 [, int $flags = 0 ]] )

As a result, you're passing PREG_SPLIT_DELIM_CAPTURE (2) + PREG_SPLIT_NO_EMPTY (1) = 3 as the $limit. That's why it's stopping at three.

Add null as the $limit parameter, and you're in good shape.

preg_split($pattern, $str, null, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY)

Upvotes: 3

Related Questions