andrefadila
andrefadila

Reputation: 687

RegEx to exclude number using PHP

This question is a continuation of my previous question:

RegEx to exclude academic title

I want split paragraph string into array of sentences using regular expression with character dot (.). And the next problem is about number.

Here is an example :

In this year 2013. Hello Mr. Andre, your money is Rp 40.000.

Of course the correct output :

Array ( [0] => In this year 2013 [1] => Hello Mr. Andre, your money is Rp 40.000 )

The title problem (Mr.) is already solved from my question before. I've tried with adding regex of number but still don't work.

My not worked code :

$titles_number=array('(^[0-9]*)','(?<!Mr)', '(?<!Mrs)', '(?<!Ms)');
$sentences=preg_split('/('.implode('',$titles_number).')\./',$text);
print_r($sentences);

Can I do this with one blow (one regex to get rid two problem)? Tell me if I can't do it. Thanks in advance

Upvotes: 0

Views: 217

Answers (2)

Alan Moore
Alan Moore

Reputation: 75272

This will be easier to accomplish with preg_match_all():

preg_match_all(
    '/[^\s.][^.]*(?:\.(?:(?<=Prof\.|Dr\.|Mr\.|Mrs\.|Ms\.)|(?=\d))[^.]*)*\./',
    $subject, $result, PREG_PATTERN_ORDER);
print_r($result[0]);

explanation:

  • [^\s.] matches the next non-whitespace character (i.e., skip over any whitespace between sentences)
  • [^.]* gobbles up any non-dot characters
  • \. matches a dot IF...
  • (?<=Prof\.|Dr\.|Mr\.|Mrs\.|Ms\.) ...it's part of an honorific...
  • (?=\d) ...or part of a number

notes:

  1. (?<=Prof\.|Dr\.|Mr\.|Mrs\.|Ms\.) is legal because the alternation is at the top level. That is, it acts like several discrete lookbehinds, each with a fixed length. That's why I had to repeat the \. in every branch instead of using (?<=(?:Prof|Dr|Mr|Mrs|Ms)\.).

  2. \.(?=\d) seems to be sufficient for identifying a dot that's part of a number. If you really have to check for digits before and after the dot, you can use (?=(?<=\d\.)\d) instead.

  3. If this is for anything more serious than a homework problem, you should discard regexes and look for a natural-language processing library. Crude as all this is, it's very close to the limit of what you can do with regexes.

Upvotes: 1

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89649

You can avoid the number problem (and probably others) if you notice that each dot at the end of a sentence is followed by a space/tab/newline or by the end of the string:

$titles=array('(?<!Mr)', '(?<!Mrs)', '(?<!Ms)');
$sentences=preg_split('/('.implode('',$titles).')\.(?=\s|$)/',$text);
print_r($sentences);

Upvotes: 0

Related Questions