Alon Shmiel
Alon Shmiel

Reputation: 7121

Remove and replace characters by regex

I'm trying to write a regex that makes the next things:

  1. _ -> replace it by a space
  2. + -> remove it if there is not another + after it (i.e. c++ => c++. c+ -> c)
  3. ' -> remove it if it's in the start or end of the word (i.e. Alin's -> Alin's. 'Alin's -> alin's)
  4. &, -, ., ! - Don't remove.
  5. Another special characters - remove

I want to do it by passing one time the string

for example:

Input: "abc's, test_s! & c++ c+ 'Dirty's'. and beautiful'..."
Output: "abc's test s! & c++ c Dirty's. and beautiful..."

Explanation:

char `'` in `abc's,` stays because `3`
char `,` in `abc's,` was removed because `5` 
char `_` in `test_s!` was replaced by space because `1`
char `!` in `test_s!` is not removed because `!`
char `&` is not removed because `4`
char `+` in `c++` is not removed because `2`
char `+` in `c+` was removed because `2`
word: `'Dirty's'.` was replaced to `Dirty's.` because `3` and `4`
char `'` in `beautiful'...` was removed because `3`
char `.` is not removed because of `4`

This is my javascript code:

var str = "abc's test_s c++ c+ 'Dirty's'. and beautiful";
console.log(str);
str = str.replace(/[_]/g, " ");
str = str.replace(/[^a-zA-Z0-9 &-.!]/g, "");
console.log(str);

This is my jsfiddle: http://jsfiddle.net/alonshmiel/LKjYd/4/

I don't like my code because I'm sure that it's possible to do it by running one time over the string.

Any help appreciated!

Upvotes: 0

Views: 599

Answers (4)

Gaël Barbin
Gaël Barbin

Reputation: 3919

function sanitize(str){

  return str.replace(/(_)|(\'\W|\'$)|(^\'|\W\')|(\+\+)|([a-zA-Z0-9\ \&\-\.\!\'])|(.)/g,function(car,p1,p2,p3,p4,p5,p6){

   if(p1) return " "; 
   if(p2) return sanitize(p2.slice(1));
   if(p3) return sanitize(p3.slice(0,-1)); 
   if(p4) return p4.slice(0,p4.length-p4.length%2); 
   if(p5) return car;
   if(p6) return ""; 
 });
}
document.querySelector('#sanitize').addEventListener('click',function(){
  
  document.querySelector('#output').innerHTML=      
	  sanitize(document.querySelector('#inputString').value);
});
#inputString{
  width:290px
}
#sanitize{
  background: #009afd;
  border: 1px solid #1777b7;
  border:none;
  color:#fff;
  cursor:pointer;
  height: 1.55em;
}

#output{
  background:#ddd;
  margin-top:5px;
  width:295px;
}
<input id="inputString" type="text" value="abc's test_s! & c++ c+ 'Dirty's'. and beau)'(tiful'..."/>
<input id="sanitize" type="button" value="Sanitize it!"" />
<div id="output" ></div>

some points:

  • one pass constraint is not fully respected, due to the obligation to sanitize the character captured with \W. I do not find any other way.
  • about the ++ rule: any sequence of + is reduced by one + if impair.
  • apostrophs are only removed if there is a non alphanumeric character next to it. What should you want to do with, for example: "abc'&". "abc&" or "abc'&"? And also for "ab_'s".

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace#Specifying_a_function_as_a_parameter

Upvotes: 3

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

Because the replacement you need can be different (nothing or a space), you can't use a fixed string (due to the one-pass constraint). So the only way is to use a dynamic replacement.

direct approach:

let's try to find the characters to remove, and to preserve in certain cases the others:

var str = "abc's, test_s! & c++ c+ 'Dirty's'. and beautiful'...";

var re = /[^\w\s&.!'+-]+|\B'+|'+\B|(\+{2,})|\+|'*(_)'*/g; 

var result = str.replace(re, function (_, g1, g2) {
    if (g1) return g1;
    return (g2) ? ' ' : ''; });

console.log(result);

when an underscore is found, the capture group 2 is defined (g2 in the callback function) and a space is returned.

Note: in the above example the term "word" is taken in a regex meaning (the character class \w so [a-zA-Z0-9_] except for the underscore), but if you want to be more rigorous, for example to exclude single quotes near digits, you need to change the pattern a little:

var re = /[^\w\s&.!'+-]+|(_)'*|([^a-z])'+|'+(?![a-z])|(\+{2,})|\+|^'+/gi;

var result = str.replace(re, function (_, g1, g2, g3) {
    if (g2) return g2;
    if (g3) return g3;
    return (g1) ? ' ' : ''; });

Note about the two patterns:

These two patterns consist in an alternation of 6 or 7 subpatterns that can match about 1 or 2 characters most of the time. Keep in mind that to find a character to remove, these patterns must test the 6 or 7 alternatives before failing for each character that must not be replaced. It's an important cost and most of the time a character doesn't need to be replaced.

There is a way to reduce this cost you can apply here: the first character discrimination

The idea is to avoid as much as possible to test each subpatterns. This can be done here because all subpatterns don't begin with a letter, so you can quickly skip all characters that are a letter without to have to test each subpatterns, if you add a lookahead at the begining. Example for pattern 2:

var re = /(?=[^a-z])(?:[^\w\s&.!'+-]+|(_)'*|([^a-z])'+|'+(?![a-z])|(\+{2,})|\+|^'+)/gi;

For the first pattern you can skip more characters:

var re = /(?=[^a-z0-9\s&.!-])(?:[^\w\s&.!'+-]+|\B'+|'+\B|(\+{2,})|\+|'*(_)'*)/gi;

Despite these improvements, these two patterns need a lot of steps for a small string (~400) (but consider that it's an example string with all the possible cases in it).

a more indirect approach:

Now let's try an other way that consists to find a character to replace, but this time with all characters before it.

var re = /((?:[a-z]+(?:'[a-z]+)*|\+{2,}|[\s&.!-]+)*)(?:(_)|.)?/gi

var result = str.replace(re, function (_, g1, g2) {
    return g1 + ((g2) ? ' ' : '' );
});

(Note that there is no need to prevent a catastrophic backtracking because (?:a+|b+|c+)* is followed by an always-true subpattern (?:d|e)?. Beside, the whole pattern will never fail whatever the string or the position in it.)

All characters before the character to replace (the allowed content) are captured and returned by the callback function.

This way needs more than 2x less steps to do the same job.

Upvotes: 2

Ahosan Karim Asik
Ahosan Karim Asik

Reputation: 3299

Try this: by regex /(?!\b)'|'(?=\B)|^'|'$|[^\w\d\s&-.!]|\+(?=[^+])/gm

function sanitize(str) {
  var re = /(?!\b)'|'(?=\B)|^'|'$|[^\w\d\s&-.!]|\+(?=[^+])/gm;
  var subst = '';
  var tmp = str.replace(re, subst);  // remove all condition without (_) 
  var result = tmp.replace("_", " ");  // next replace (_) by ( ) space.
  return result;
}

document.querySelector('#sanitize').addEventListener('click', function() {

  document.querySelector('#output').innerHTML =
    sanitize(document.querySelector('#inputString').value);
});
#inputString {
  width: 290px
}
#sanitize {
  background: #009afd;
  border: 1px solid #1777b7;
  border: none;
  color: #fff;
  cursor: pointer;
  height: 1.55em;
}
#output {
  background: #eee;
  margin-top: 5px;
  width: 295px;
}
<input id="inputString" type="text" value="abc's test_s! & c++ c+ 'Dirty's'. and beau)'(tiful'..." />
<input id="sanitize" type="button" value="Sanitize it!" />
<div id="output"></div>

Upvotes: 1

Amit Joki
Amit Joki

Reputation: 59232

What you need is chaining and alternation operator

function customReplace(str){
   return str.replace(/_/g, " ").replace(/^'|'$|[^a-zA-Z0-9 &-.!]|\+(?=[^+])/g,"");
}

The regex /^'|'$|[^a-zA-Z0-9 &-.!]|\+(?=[^+])/g combines all that is needed to be removed. And we replace all _ by a space, which we finally return.

\+(?=[^+]) looks for + that is followed by anything except +

Also, the ordering of the replace is important.

Upvotes: 1

Related Questions