Kikkomann
Kikkomann

Reputation: 416

Split a string on several separators while keeping one or more separator

Is there a way to split a string based on several separators while keeping some of the separators in the splitted array? So if I have the string "This is a-weird string,right?" I would like to get

["This", "is", "a", "-", "weird", "string", ",", "right", "?"]

I have tried using string.split(/([^a-zA-Z])/g), but I don't want to keep the whitespace. This guide seems like being something I can use, but my understanding of regex is not good enough to know how to mix those two.

Upvotes: 3

Views: 553

Answers (3)

sbgib
sbgib

Reputation: 5828

Try like this:

const str = "This is a-weird string,right?";

var arr = str.replace(/(\S)([\,\-])/g, "$1 $2").replace(/([\,\-])(\S)/g, "$1 $2").split(" ");

console.log(arr);

You can replace using each delimiter you're interested in so that it has a space on each side, then use that to split and return an array.

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626691

You can use

console.log("This is a-weird string,right?".match(/[^\W_]+|[^\w\s]|_/g))

The regex matches:

  • [^\W_]+ - one or more alphanumeric chars
  • | - or
  • [^\w\s] - any char other than word and whitespace
  • | - or
  • _ - an underscore.

See the regex demo.

A fully Unicode aware regex will be

console.log("This is ą-węird string,right?".match(/[\p{L}\p{M}\p{N}]+|[\p{P}\p{S}]/gu))

Here,

  • [\p{L}\p{M}\p{N}]+ - one or more Unicode letters, diacritics or digits
  • | - or
  • [\p{P}\p{S}] - a single punctuation proper or symbol char.

See this regex demo.

Upvotes: 4

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520878

Here is a regex splitting approach. We can try splitting on the following pattern:

\s+|(?<=\w)(?=\W)|(?<=\W)(?=\w)

Code snippet:

var input = "This is a-weird string,right?";
var parts = input.split(/\s+|(?<=\w)(?=\W)|(?<=\W)(?=\w)/);
console.log(parts);

Here is an explanation of the regex pattern used, which says to split on:

\s+            whitespace
|              OR
(?<=\w)(?=\W)  the boundary between a word character preceding and non word
               character following
|              OR
(?<=\W)(?=\w)  the boundary between a non word character preceding and word
               character following

Upvotes: 2

Related Questions