Dimitar Spasovski
Dimitar Spasovski

Reputation: 2132

Splitting a string by white space and a period when not surrounded by quotes

I know that similar questions have been asked many times, but my regular expression knowledge is pretty bad and I can't get it to work for my case. So here is what I am trying to do:

I have a text and I want to separate the sentences. Each sentence ends with some white space and a period (there can be one or many spaces before the period, but there is always at least one).

At the beginning I used /\s+\./ and it worked great for separating the sentences, but then I noticed that there are cases such as this one: "some text . some text".

Now, I don't want to separate the text in quotes. I searched and found a lot of solutions that work great for spaces (for example: /(".*?"|[^"\s]+)+(?=\s*|\s*$)/), but I was not able to modify them to separate by white space and a period.

Here is the code that I am using at the moment.

var regex = /\s+\./;
        var result = regex.exec(fullText);
        if(result == null) {
            break;
        }
        var length = result[0].length;
        var startingPoint = result.index;
        var currentSentence = fullText.substring(0,startingPoint).trim();

        fullText = fullText.substring(startingPoint+length);

I am separating the sentences one by one and removing them from the full text. The length var represents the size of the portion that needs to be removed and startingPoint is the position on which the portion starts. The code is part of a larger while cycle.

Upvotes: 4

Views: 1677

Answers (2)

Dmitry Egorov
Dmitry Egorov

Reputation: 9650

Instead of splitting you may try and match all sentences between delimiters. In this case it will be easier to skip delimiters in quotes. The respective regex is:

(.*?(?:".*?".*?)?|.*?)(?: \.|$)

Demo: https://regex101.com/r/iS9fN6/1

The sentences then may be retrieved in this loop:

while (match = regex.exec(input)) {
    console.log(match[1]); // each next sentence is in match[1]
}

BUT! This particular expression makes regex.exec(input) return true infinitely! (Looks like a candidate to one more SO question.)

So I can only suggest a workaround with removing the $ from the expression. This will cause the regex to miss the last part which later may be extracted as a trailer not matched by the regex:

var input = "some text . some text . \"some text . some text\" some text . some text";
//var regex = /(.*?(?:".*?".*?)?|.*?)(?: \.|$)/g;
var regex = /(.*?(?:".*?".*?)?|.*?) \./g;
var trailerPos = 0;
while (match = regex.exec(input)) {
    console.log(match[1]);    // each next sentence is in match[1]
    trailerPos = match.index + match[0].length;
}
if (trailerPos < input.length) {
    console.log(input.substring(trailerPos));    // the last sentence in
                                                 // input.substring(trailerPos)
}

Update:

If sentences span multiple lines, the regex won't work since . pattern does not match the newline character. In this case just use [\s\S] instead of .:

var input = "some \ntext . some text . \"some\n text . some text\" some text . so\nm\ne text";
var regex = /([\s\S]*?(?:"[\s\S]*?"[\s\S]*?)?|[\s\S]*?) \./g;
var trailerPos = 0;
var sentences = []
while (match = regex.exec(input)) {
    sentences.push(match[1]);
    trailerPos = match.index + match[0].length;
}
if (trailerPos < input.length) {
    sentences.push(input.substring(trailerPos)); 
}
sentences.forEach(function(s) {
    console.log("Sentence: -->%s<--", s);
});

Upvotes: 2

Rohit
Rohit

Reputation: 93

Use the encode and decode of javascript while sending and receiving.

Upvotes: 0

Related Questions