Reputation: 26142
I've been trying to achieve this: I want to wrap words into tag and spaces (which may be multiple) in tag, assuming original text can contain html tags that should not be toched
This is <b>very bold</b> word.
convert to -->
<w>This</w><s> </s><w>is</w><s> </s><b><w>very</w><s> </s><w>bold</w></b><s> </s><w>word</w>
What is the right regEx to achieve that?
Upvotes: 4
Views: 772
Reputation: 43683
You should use two replacements >>
s.replace(/([^\s<>]+)(?:(?=\s)|$)/g, '<w>$1</w>').replace(/(\s+)/g, '<s>$1</s>')
Check this demo.
EDIT:
For more complex inputs (based on your comment below), go with >>
s.replace(/([^\s<>]+)(?![^<>]*>)(?:(?=[<\s])|$)/g, '<w>$1</w>').replace(/(\s+)(?![^<>]*>)/g, '<s>$1</s>');
Check this demo.
Upvotes: 1
Reputation: 22508
Regular expressions are not suited for every task. If your string can contain arbitrary HTML, than it's not possible to handle all cases using regular expressions, because HTML is a context-free language and regular expressions covers only a subset of them. Now before messing around with loops and a load of code to handle this, let me suggest the following:
If you are in a browser environment or have access to a DOM library, you could put this string inside a temporary DOM element, then work on the text nodes and then read the string back.
Here's an example using a lib I wrote some month and updated now which is called Linguigi
var element = document.createElement('div');
element.innerHTML = 'This is <b>very bold</b> word.';
var ling = new Linguigi(element);
ling.eachWord(true, function(text) {
return '<w>' + text + '</w>';
});
ling.eachToken(/ +/g, true, function(text) {
return '<s>' + text + '</s>';
});
alert(element.innerHTML);
Example: http://prinzhorn.github.com/Linguigi/ (hit the Stackoverflow 12758422
button)
Upvotes: 0