Reputation: 7094
I have random text stored in $sentences
. Using regex, I want to split the text into sentences, see:
function splitSentences($text) {
$re = '/ # Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?] # Either an end of sentence punct,
| [.!?][\'"] # or end of sentence punct and quote.
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| T\.V\.A\. # or "T.V.A.",
# or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences.
/ix';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
return $sentences;
}
$sentences = splitSentences($sentences);
print_r($sentences);
It works fine.
However, it doesn't split into sentences if there are unicode characters:
$sentences = 'Entertainment media properties. Fairy Tail and Tokyo Ghoul.';
Or this scenario:
$sentences = "Entertainment media properties. Fairy Tail and Tokyo Ghoul.";
What can I do to make it work when unicode characters exist in the text?
Here is an ideone for testing.
I am looking for a complete solution to this. Before posting an answer, please read the comment thread I had with WiktorStribiżew for more relevant info on this issue.
Upvotes: 24
Views: 3631
Reputation: 21
I know this question is old and has been nicely answer by @ndnenkov but I figured i could clean up the PHP and make it more efficient since it was really slow on large bodies of text.
Here are my updates:
function sentence_split($text) {
// put regex tests into an easier to read array
$regexes = array(
[
"is_sentence_boundary"=>false,
"before"=>'/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
"after"=>'/\A(?:)/su'
],
[
"is_sentence_boundary"=>false,
"before"=>'/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
"after"=>'/\A(?:[\p{N}\p{Ll}])/su'
],
[
"is_sentence_boundary"=>false,
"before"=>'/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
"after"=>'/\A(?:[^\p{Lu}])/su'
],
[
"is_sentence_boundary"=>false,
"before"=>'/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs)\.\s))\Z/su',
"after"=>'/\A(?:[^\p{Lu}]|I)/su'
],
[
"is_sentence_boundary"=>false,
"before"=>'/(?:(?:\b[Ee]tc\.\s))\Z/su',
"after"=>'/\A(?:[^p{Lu}])/su'
],
[
"is_sentence_boundary"=>false,
"before"=>'/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
"after"=>'/\A(?:\p{Ll})/su'
],
[
"is_sentence_boundary"=>false,
"before"=>'/(?:(?:\b\p{L}\.))\Z/su',
"after"=>'/\A(?:\p{L}\.)/su'
],
[
"is_sentence_boundary"=>false,
"before"=>'/(?:(?:\b\p{L}\.\s))\Z/su',
"after"=>'/\A(?:\p{L}\.\s)/su'
],
[
"is_sentence_boundary"=>false,
"before"=>'/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
"after"=>'/\A(?:\p{N})/su'
],
[
"is_sentence_boundary"=>false,
"before"=>'/(?:(?:[\"”\']\s*))\Z/su',
"after"=>'/\A(?:\s*\p{Ll})/su'
],
[
"is_sentence_boundary"=>true,
"before"=>'/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
"after"=>'/\A(?:)/su'
],
[
"is_sentence_boundary"=>true,
"before"=>'/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
"after"=>'/\A(?:\p{Lu}[^\p{Lu}])/su'
],
[
"is_sentence_boundary"=>true,
"before"=>'/(?:(?:\s\p{L}[\.!?…]\s))\Z/su',
"after"=>'/\A(?:\p{Lu}\p{Ll})/su'
]
);
$sentences = array();
$sentence = '';
$before = '';
$testLen = 10; // Used to set before/after chunk sizes. 10 seems to be the smallest that works the best.
$after = substr($text, 0, $testLen); // start with the first set of chars.
while($text != '') {
// run regex tests
foreach($regexes as $reg) {
if(preg_match($reg["before"], $before) && preg_match($reg["after"], $after)) {
// if this passes a sentence ending test then add to the array
if($reg["is_sentence_boundary"]) {
$sentences[] = $sentence;
$sentence = '';
}
break;
}
}
// add the char to the sentence
$sentence .= $after[0];
// eat at text until empty to end loop
$text = substr($text, 1);
// add a char behind the before var and then remove the first char
$before = substr($before.$after[0], -$testLen);
// create a new after with the first chars from the text
$after = substr($text, 0, $testLen);
}
if($sentence != '') {
$sentences[] = $sentence . $after;
}
return $sentences;
}
$text = "Mr. Entertainment media properties. Fairy Tail 3.5 and Tokyo Ghoul.";
print_r(sentence_split($text));
Upvotes: 2
Reputation: 36110
As it should be expected, any sort of natural language processing is not a trivial task. The reason for it is that they are evolutionary systems. There is no single person who sat down and thought about which are good ideas and which - not. Every rule has 20-40% exceptions. With that said the complexity of a single regex that can do your bidding would be off the charts. Still, the following solution relies mainly on regexes.
As for where did these regexes come from? - I translated this Ruby library, which is generated based on this paper. If you truly want to understand them, there is no alternative but to read the paper.
As far as accuracy goes - I encourage you to test it with different texts. After some experimentation, I was very pleasantly surprised.
In terms of performance - the regexes should be highly performant as all of them have either a \A
or \Z
anchor, there are almost no repetition quantifiers, and in the places there are - there can't be any backtracking. Still, regexes are regexes. You will have to do some benchmarking if you plan to use this is tight loops on huge chunks of text.
Mandatory disclaimer: excuse my rusty php skills. The following code might not be the most idiomatic php ever, it should still be clear enough to get the point across.
function sentence_split($text) {
$before_regexes = array('/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
'/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
'/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
'/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs)\.\s))\Z/su',
'/(?:(?:\b[Ee]tc\.\s))\Z/su',
'/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
'/(?:(?:\b\p{L}\.))\Z/su',
'/(?:(?:\b\p{L}\.\s))\Z/su',
'/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
'/(?:(?:[\"”\']\s*))\Z/su',
'/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
'/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
'/(?:(?:\s\p{L}[\.!?…]\s))\Z/su');
$after_regexes = array('/\A(?:)/su',
'/\A(?:[\p{N}\p{Ll}])/su',
'/\A(?:[^\p{Lu}])/su',
'/\A(?:[^\p{Lu}]|I)/su',
'/\A(?:[^p{Lu}])/su',
'/\A(?:\p{Ll})/su',
'/\A(?:\p{L}\.)/su',
'/\A(?:\p{L}\.\s)/su',
'/\A(?:\p{N})/su',
'/\A(?:\s*\p{Ll})/su',
'/\A(?:)/su',
'/\A(?:\p{Lu}[^\p{Lu}])/su',
'/\A(?:\p{Lu}\p{Ll})/su');
$is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true);
$count = 13;
$sentences = array();
$sentence = '';
$before = '';
$after = substr($text, 0, 10);
$text = substr($text, 10);
while($text != '') {
for($i = 0; $i < $count; $i++) {
if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
if($is_sentence_boundary[$i]) {
array_push($sentences, $sentence);
$sentence = '';
}
break;
}
}
$first_from_text = $text[0];
$text = substr($text, 1);
$first_from_after = $after[0];
$after = substr($after, 1);
$before .= $first_from_after;
$sentence .= $first_from_after;
$after .= $first_from_text;
}
if($sentence != '' && $after != '') {
array_push($sentences, $sentence.$after);
}
return $sentences;
}
$text = "Mr. Entertainment media properties. Fairy Tail 3.5 and Tokyo Ghoul.";
print_r(sentence_split($text));
Upvotes: 13
Reputation: 3543
Henrik Petterson Please read it completely because i need to repeat few things which already said above.
As above many people have mentioned that if you add a \u modifier it will work on Unicode character is TRUE and it is Working Perfectly in the example mentioned below
<?php
function splitSentences($text) {
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?] # Either an end of sentence punct,
| [.!?][\'"] # or end of sentence punct and quote.
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| Ms\. # or "Ms.",
| Jr\. # or "Jr.",
| Dr\. # or "Dr.",
| Prof\. # or "Prof.",
| Vol\. # or "Vol.",
| A\.D\. # or "A.D.",
| B\.C\. # or "B.C.",
| Sr\. # or "Sr.",
| T\.V\.A\. # or "T.V.A.",
# or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences.
/uix';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
return $sentences;
}
$sentences = 'Entertainment media properties. Ã Fairy Tail and Tokyo Ghoul. Entertainment media properties. Â Fairy Tail and Tokyo Ghoul.';
$sentences = splitSentences($sentences);
print_r($sentences);
Your examples which you have given in comments were not working because they don't have any white space characters between two sentences. And your code specifying it particularly that there must be a white space between sentences.
\s+ # Split on whitespace between sentences.
The below example which you have in above comments is not working just because there is no space before Â.
Upvotes: 3
Reputation: 31273
There is quite complex Unicode Text Segmentation algorithm that deals with various text boundaries including sentence boundaries.
http://unicode.org/reports/tr29/
The best known implementation of this algorithms is by ICU.
I have found this class: http://php.net/manual/en/class.intlbreakiterator.php however it seems to be in git not in mainstream.
So if you want to solve this VERY complex problem in best why I'd suggest to:
Upvotes: 1
Reputation: 627083
I believe that it is impossible to get a bullet-proof sentence splitter considering user-generated content is not always grammatically and syntactically correct. Moreover, reaching 100% correct results is just impossible due to technical imperfection of scraping/content getting tools that may fail to get clean contents that will either contain whitespace or punctuation rubbish. And finally, business is now more biased towards a good-enough strategy, and if you manage to split the text into 95% of times, it is in most cases considered a success.
Now, any sentence splitting task is an NLP task, and just one, or two, or three regexps are not enough. Rather than think of your own regex chain, I'd advise to use some existing NLP libraries for that.
The following is a rough list of the rules used to split sentences.
- Each linebreak separates sentences.
- The end of the text indicates the end if a sentence if not otherwise ended through proper punctuation.
- Sentences must be at least two words long, unless a linebreak or end-of-text.
- An empty line is not a sentence.
- Each question- or exclamation mark or combination thereof, is considered the end of a sentence.
- A single period is considered the end of a sentence, unless...
- It is preceded by one word, or...
- It is followed by one word.
- A sequence of multiple periods is not considered the end of a sentence.
Usage example:
<?php
require_once 'classes/autoloader.php'; // Include the autoloader.
$text = "Hello there, Mr. Smith. What're you doing today... Smith,"
. " my friend?\n\nI hope it's good. This last sentence will"
. " cost you $2.50! Just kidding :)"; // This is the test text we're going to use
$Sentence = new Sentence; // Create a new instance
$sentences = $Sentence->split($text); // Split into array of sentences
$count = $Sentence->count($text); // Count the number of sentences
?>
Sample code:
<?php
include ('vendor/autoload.php');
use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
use \NlpTools\Tokenizers\WhitespaceTokenizer;
use \NlpTools\Classifiers\ClassifierInterface;
use \NlpTools\Documents\DocumentInterface;
class EndOfSentence implements ClassifierInterface
{
public function classify(array $classes, DocumentInterface $d) {
list($token,$before,$after) = $d->getDocumentData();
$dotcnt = count(explode('.',$token))-1;
$lastdot = substr($token,-1)=='.';
if (!$lastdot) // assume that all sentences end in full stops
return 'O';
if ($dotcnt>1) // to catch some naive abbreviations U.S.A.
return 'O';
return 'EOW';
}
}
$tok = new ClassifierBasedTokenizer(
new EndOfSentence(),
new WhitespaceTokenizer()
);
$text = "We are what we repeatedly do.
Excellence, then, is not an act, but a habit.";
print_r($tok->tokenize($text));
// Array
// (
// [0] => We are what we repeatedly do.
// [1] => Excellence, then, is not an act, but a habit.
// )
IMPORTANT NOTE: Most NLP tokenization models I tested do not handle glued sentences well. However, if you add a space after a punctuation chain, sentence splitting quality raises. Just add this before sending the text to the sentence splitting function:
$txt = preg_replace('~\p{P}+~', "$0 ", $txt);
Upvotes: 2
Reputation: 16573
If spaces are unreliable, than you could use match on a .
followed by any number of spaces, followed by a capital letter.
You can match any capital UTF-8 letter using the Unicode character property \p{Lu}
.
You only need to exclude abbreviations which tend to follow own names (person names, company names, etc), since they start with a capital letter.
function splitSentences($text) {
$re = '/ # Split sentences ending with a dot
.+? # Match everything before, until we find
(
$ | # the end of the string, or
\. # a dot
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
# or... (you get the idea).
) # End negative lookbehind.
"? # Optionally match a quote
\s* # Any number of whitespaces
(?= # Begin positive lookahead
\p{Lu} | # an upper case letter, or
" # a quote
)
)
/iux';
if (!preg_match_all($re, $text, $matches, PREG_PATTERN_ORDER)) {
return [];
}
$sentences = array_map('trim', $matches[0]);
return $sentences;
}
$text = "Mr. Entertainment media properties. Fairy Tail 3.5 and Tokyo Ghoul.";
$sentences = splitSentences($text);
print_r($sentences);
Note: This answer might not be accurate enough for your situation. I'm unable to judge that. It does address the problem as described above and is easily understandable.
Upvotes: 3
Reputation: 536567
Â
is what it looks like when you print a UTF-8 character U+00A0 Non-Breaking Space to a page/console being interpreted as Latin-1. So I think you have a non-breaking space between the sentences, not a normal space.
\s
can match a non-breaking space too, but you will need to use the /u
modifier to tell preg you are sending it a UTF-8-encoded string. Otherwise it, like your print command, will guess Latin-1 and see it as the two characters Â
.
Upvotes: 6