Reputation: 3348
I'm trying to use a regex to split a chunk of Chinese text into sentences. For my purposes, sentence delimiters are:
Now, let's say my $str is this:
$str = "你好。你好吗? 我是程序员,不太懂这个我问题,希望大家能够帮忙!一起加油吧!";
I use preg_split with these parameters:
$str2 = preg_split("/([\x{3002}\x{FF01}\x{FF1F}])/u",$str,PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
$str2 is now an array that looks like this:
array(3) { [0]=> string(6) "你好" [1]=> string(9) "你好吗" [2]=> string(91) " 我是程序员,不太懂这个我问题,希望大家能够帮忙!一起加油吧!" }
However, the expected output is:
[0] "你好。"
[1] "你好吗?"
[2] "我是程序员,不太懂这个我问题,希望大家能够帮忙!"
[3] "一起加油吧!"
As you can see, there are two problems: this does not process exclamation marks properly, and second, my fullwidth full stop and fullwidth question marks vanish. I'd expect delim_capture to keep them. I've been looking at this code for so long I can't possibly figure out what the problem is anymore. I would very much appreciate suggestions.
Upvotes: 3
Views: 1694
Reputation: 785246
Your regex code should be like this to be able to capture string + delimiter:
$str = "你好。你好吗? 我是程序员,不太懂这个我问题,希望大家能够帮忙!一起加油吧!";
$arr = preg_split("/\s*([^\x{3002}\x{FF01}\x{FF1F}]+[\x{3002}\x{FF01}\x{FF1F}]\s*)/u",
$str, 0, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY );
var_dump($arr);
OUTPUT:
array(4) {
[0]=> string(9) "你好。"
[1]=> string(13) "你好吗? "
[2]=> string(72) "我是程序员,不太懂这个我问题,希望大家能够帮忙!"
[3]=> string(18) "一起加油吧!"
}
Upvotes: 4
Reputation: 20883
You're missing the $limit
parameter to preg_split()
.
array preg_split ( string $pattern , string $subject [, int $limit = -1 [, int $flags = 0 ]] )
As a result, you're passing PREG_SPLIT_DELIM_CAPTURE
(2) + PREG_SPLIT_NO_EMPTY
(1) = 3
as the $limit
. That's why it's stopping at three.
Add null
as the $limit
parameter, and you're in good shape.
preg_split($pattern, $str, null, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY)
Upvotes: 3