Reputation: 7121
I'm trying to write a regex that makes the next things:
_
-> replace it by a space+
-> remove it if there is not another + after it (i.e. c++
=> c++
. c+
-> c
)'
-> remove it if it's in the start or end of the word (i.e.
Alin's
-> Alin's
. 'Alin's
-> alin's
)&
, -
, .
, !
- Don't remove.I want to do it by passing one time the string
for example:
Input: "abc's, test_s! & c++ c+ 'Dirty's'. and beautiful'..."
Output: "abc's test s! & c++ c Dirty's. and beautiful..."
Explanation:
char `'` in `abc's,` stays because `3`
char `,` in `abc's,` was removed because `5`
char `_` in `test_s!` was replaced by space because `1`
char `!` in `test_s!` is not removed because `!`
char `&` is not removed because `4`
char `+` in `c++` is not removed because `2`
char `+` in `c+` was removed because `2`
word: `'Dirty's'.` was replaced to `Dirty's.` because `3` and `4`
char `'` in `beautiful'...` was removed because `3`
char `.` is not removed because of `4`
This is my javascript
code:
var str = "abc's test_s c++ c+ 'Dirty's'. and beautiful";
console.log(str);
str = str.replace(/[_]/g, " ");
str = str.replace(/[^a-zA-Z0-9 &-.!]/g, "");
console.log(str);
This is my jsfiddle: http://jsfiddle.net/alonshmiel/LKjYd/4/
I don't like my code because I'm sure that it's possible to do it by running one time over the string.
Any help appreciated!
Upvotes: 0
Views: 599
Reputation: 3919
function sanitize(str){
return str.replace(/(_)|(\'\W|\'$)|(^\'|\W\')|(\+\+)|([a-zA-Z0-9\ \&\-\.\!\'])|(.)/g,function(car,p1,p2,p3,p4,p5,p6){
if(p1) return " ";
if(p2) return sanitize(p2.slice(1));
if(p3) return sanitize(p3.slice(0,-1));
if(p4) return p4.slice(0,p4.length-p4.length%2);
if(p5) return car;
if(p6) return "";
});
}
document.querySelector('#sanitize').addEventListener('click',function(){
document.querySelector('#output').innerHTML=
sanitize(document.querySelector('#inputString').value);
});
#inputString{
width:290px
}
#sanitize{
background: #009afd;
border: 1px solid #1777b7;
border:none;
color:#fff;
cursor:pointer;
height: 1.55em;
}
#output{
background:#ddd;
margin-top:5px;
width:295px;
}
<input id="inputString" type="text" value="abc's test_s! & c++ c+ 'Dirty's'. and beau)'(tiful'..."/>
<input id="sanitize" type="button" value="Sanitize it!"" />
<div id="output" ></div>
some points:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
Upvotes: 3
Reputation: 89547
Because the replacement you need can be different (nothing or a space), you can't use a fixed string (due to the one-pass constraint). So the only way is to use a dynamic replacement.
direct approach:
let's try to find the characters to remove, and to preserve in certain cases the others:
var str = "abc's, test_s! & c++ c+ 'Dirty's'. and beautiful'...";
var re = /[^\w\s&.!'+-]+|\B'+|'+\B|(\+{2,})|\+|'*(_)'*/g;
var result = str.replace(re, function (_, g1, g2) {
if (g1) return g1;
return (g2) ? ' ' : ''; });
console.log(result);
when an underscore is found, the capture group 2 is defined (g2
in the callback function) and a space is returned.
Note: in the above example the term "word" is taken in a regex meaning (the character class \w
so [a-zA-Z0-9_]
except for the underscore), but if you want to be more rigorous, for example to exclude single quotes near digits, you need to change the pattern a little:
var re = /[^\w\s&.!'+-]+|(_)'*|([^a-z])'+|'+(?![a-z])|(\+{2,})|\+|^'+/gi;
var result = str.replace(re, function (_, g1, g2, g3) {
if (g2) return g2;
if (g3) return g3;
return (g1) ? ' ' : ''; });
Note about the two patterns:
These two patterns consist in an alternation of 6 or 7 subpatterns that can match about 1 or 2 characters most of the time. Keep in mind that to find a character to remove, these patterns must test the 6 or 7 alternatives before failing for each character that must not be replaced. It's an important cost and most of the time a character doesn't need to be replaced.
There is a way to reduce this cost you can apply here: the first character discrimination
The idea is to avoid as much as possible to test each subpatterns. This can be done here because all subpatterns don't begin with a letter, so you can quickly skip all characters that are a letter without to have to test each subpatterns, if you add a lookahead at the begining. Example for pattern 2:
var re = /(?=[^a-z])(?:[^\w\s&.!'+-]+|(_)'*|([^a-z])'+|'+(?![a-z])|(\+{2,})|\+|^'+)/gi;
For the first pattern you can skip more characters:
var re = /(?=[^a-z0-9\s&.!-])(?:[^\w\s&.!'+-]+|\B'+|'+\B|(\+{2,})|\+|'*(_)'*)/gi;
Despite these improvements, these two patterns need a lot of steps for a small string (~400) (but consider that it's an example string with all the possible cases in it).
a more indirect approach:
Now let's try an other way that consists to find a character to replace, but this time with all characters before it.
var re = /((?:[a-z]+(?:'[a-z]+)*|\+{2,}|[\s&.!-]+)*)(?:(_)|.)?/gi
var result = str.replace(re, function (_, g1, g2) {
return g1 + ((g2) ? ' ' : '' );
});
(Note that there is no need to prevent a catastrophic backtracking because (?:a+|b+|c+)*
is followed by an always-true subpattern (?:d|e)?
. Beside, the whole pattern will never fail whatever the string or the position in it.)
All characters before the character to replace (the allowed content) are captured and returned by the callback function.
This way needs more than 2x less steps to do the same job.
Upvotes: 2
Reputation: 3299
Try this: by regex /(?!\b)'|'(?=\B)|^'|'$|[^\w\d\s&-.!]|\+(?=[^+])/gm
function sanitize(str) {
var re = /(?!\b)'|'(?=\B)|^'|'$|[^\w\d\s&-.!]|\+(?=[^+])/gm;
var subst = '';
var tmp = str.replace(re, subst); // remove all condition without (_)
var result = tmp.replace("_", " "); // next replace (_) by ( ) space.
return result;
}
document.querySelector('#sanitize').addEventListener('click', function() {
document.querySelector('#output').innerHTML =
sanitize(document.querySelector('#inputString').value);
});
#inputString {
width: 290px
}
#sanitize {
background: #009afd;
border: 1px solid #1777b7;
border: none;
color: #fff;
cursor: pointer;
height: 1.55em;
}
#output {
background: #eee;
margin-top: 5px;
width: 295px;
}
<input id="inputString" type="text" value="abc's test_s! & c++ c+ 'Dirty's'. and beau)'(tiful'..." />
<input id="sanitize" type="button" value="Sanitize it!" />
<div id="output"></div>
Upvotes: 1
Reputation: 59232
What you need is chaining and alternation operator
function customReplace(str){
return str.replace(/_/g, " ").replace(/^'|'$|[^a-zA-Z0-9 &-.!]|\+(?=[^+])/g,"");
}
The regex /^'|'$|[^a-zA-Z0-9 &-.!]|\+(?=[^+])/g
combines all that is needed to be removed. And we replace all _
by a space, which we finally return.
\+(?=[^+])
looks for +
that is followed by anything except +
Also, the ordering of the replace is important.
Upvotes: 1