Reputation: 1592
This regex:
var text = "Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."
// break string up in to sentences based on punctation and quotation marks
var tokens = text.match(/(?<=\s+|^)[\"\'\‘\“\'\"\[\(\{\⟨](.*?[.?!])(\s[.?!])*[\"\'\’\”\'\"\]\)\}\⟩](?=\s+|$)|(?<=\s+|^)\S(.*?[.?!])(\s[.?!])*(?=\s+|$)/g);
breaks on IOS Safari due to unsupported lookbehind assertions ((?<= ) and (?<! ))
. Is there an equivalent (or similar) regex for sentence tokenization that I can use? Preferably it should not break due to other iOS safari compatibility issues as referenced here: (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp#assertions)
ECMAScript (ECMA-262)
The definition of 'RegExp' in that specification.
Upvotes: 1
Views: 2791
Reputation: 2129
the issue was in ?<=
. if you somehow replace them, in my case ?!
, it might be fine.
Upvotes: -1
Reputation: 785856
Here is a version of your regex that you can use without using any lookbehind assertions to break input into sentences:
/(?:\s|^)(?:["'‘“'"\[({⟨].*?[.?!](?:\s[.?!])*["'’”'"\])}⟩]|\S.*?[.?!](?:\s[.?!])*)(?=\s|$)/gm
Please keep in mind that your regex may break on sentences where there are words ending with dots such as Jr., Sr. Mr.
etc and few more cases like that.
Upvotes: 0