Reputation: 687
This question is a continuation of my previous question:
I want split paragraph string into array of sentences using regular expression with character dot (.). And the next problem is about number.
Here is an example :
In this year 2013. Hello Mr. Andre, your money is Rp 40.000.
Of course the correct output :
Array ( [0] => In this year 2013 [1] => Hello Mr. Andre, your money is Rp 40.000 )
The title problem (Mr.) is already solved from my question before. I've tried with adding regex of number but still don't work.
My not worked code :
$titles_number=array('(^[0-9]*)','(?<!Mr)', '(?<!Mrs)', '(?<!Ms)');
$sentences=preg_split('/('.implode('',$titles_number).')\./',$text);
print_r($sentences);
Can I do this with one blow (one regex to get rid two problem)? Tell me if I can't do it. Thanks in advance
Upvotes: 0
Views: 217
Reputation: 75272
This will be easier to accomplish with preg_match_all()
:
preg_match_all(
'/[^\s.][^.]*(?:\.(?:(?<=Prof\.|Dr\.|Mr\.|Mrs\.|Ms\.)|(?=\d))[^.]*)*\./',
$subject, $result, PREG_PATTERN_ORDER);
print_r($result[0]);
explanation:
[^\s.]
matches the next non-whitespace character (i.e., skip over any whitespace between sentences)[^.]*
gobbles up any non-dot characters\.
matches a dot IF...(?<=Prof\.|Dr\.|Mr\.|Mrs\.|Ms\.)
...it's part of an honorific...(?=\d)
...or part of a numbernotes:
(?<=Prof\.|Dr\.|Mr\.|Mrs\.|Ms\.)
is legal because the alternation is at the top level. That is, it acts like several discrete lookbehinds, each with a fixed length. That's why I had to repeat the \.
in every branch instead of using (?<=(?:Prof|Dr|Mr|Mrs|Ms)\.)
.
\.(?=\d)
seems to be sufficient for identifying a dot that's part of a number. If you really have to check for digits before and after the dot, you can use (?=(?<=\d\.)\d)
instead.
If this is for anything more serious than a homework problem, you should discard regexes and look for a natural-language processing library. Crude as all this is, it's very close to the limit of what you can do with regexes.
Upvotes: 1
Reputation: 89649
You can avoid the number problem (and probably others) if you notice that each dot at the end of a sentence is followed by a space/tab/newline or by the end of the string:
$titles=array('(?<!Mr)', '(?<!Mrs)', '(?<!Ms)');
$sentences=preg_split('/('.implode('',$titles).')\.(?=\s|$)/',$text);
print_r($sentences);
Upvotes: 0