Scott Wright
Scott Wright

Reputation: 186

Use regex to separate any string into an array of whole words, punctuation & html tags

All I have found that works at the moment is using spaces to match on. I would like to be able to match arbitrary HTML tags and punctuation.

var text = "<div>The Quick brown fox ran through it's forest darkly!</div>"

//this one uses spaces only but will match "darkly!</div>" as 1 element
console.log(text.match(/\S+/g));

//outputs: ["<div>The", "Quick", "brown", "fox", "ran", "through", "it's", "forest", "darkly!</div>"]

I want a matching expression that will output:

["<div>", "The", "Quick", "brown", "fox", "ran", "through", "it's", "forest", "darkly", "!", "</div>"]

Here is a fiddle: https://jsfiddle.net/scottpatrickwright/og0bd0xj/2/

Ultimately I am going to store all of the matches in an array, do some processing (add some span tags with a conditional data attribute around every whole word) and re-output the original string in an altered form. I mention this as solutions which don't leave the string more or less intact wouldn't work.

I am finding lots of near miss solutions online however my regex is not good enough to take advantage of their work.

Upvotes: 0

Views: 60

Answers (3)

Pęgaz
Pęgaz

Reputation: 46

My suggestion would be:

console.log(text.match(/(<.+?>|[^\s<>]+)/g));

Where in our regex: (<.+?>|[^\s<>]+) we specify two strings to catch

<.+?> returns all <text> strings
[^\s<>]+ returns all strings that don't contain space,<,>

in the secound one you could add charatcters you want to ignore

Upvotes: 0

Eric Leibenguth
Eric Leibenguth

Reputation: 4277

How about:

/(<\/?)?[\w']+>?|[!\.,;\?]/g

Demonstrated here.

Upvotes: 2

Jamie Barker
Jamie Barker

Reputation: 8246

You could just add a space before and after the HTML tags like so:

var text = "<div>The Quick brown fox ran through it's forest darkly!</div>"
text = text.replace(/\<(.*?)\>/g, ' <$1> ');
console.log(text.match(/\w+|\S+/g)); // ## Credit to George Lee ##

Upvotes: 0

Related Questions