mohsenmadi
mohsenmadi

Reputation: 2377

replace/replaceAll with regex on unicode issues

Is there a way to apply the replace method on Unicode text in general (Arabic is of concern here)? In the example below, whereas replacing the entire word works nicely on the English text, it fails to detect and as a result, replace the Arabic word. I added the u as a flag to enable unicode parsing but that didn't help. In the Arabic example below, the word النجوم should be replaced, but not والنجوم, but this doesn't happen.

<!DOCTYPE html>
<html>
<body>
<p>Click to replace...</p>
<button onclick="myFunction()">replace</button>
<p id="demo"></p>
<script>
function myFunction() {
  var str = "الشمس والقمر والنجوم، ثم النجوم والنهار";
  var rep = 'النجوم';
  var repWith = 'الليل';

  //var str = "the sun and the stars, then the starsz and the day";
  //var rep = 'stars';
  //var repWith = 'night';

  var result = str.replace(new RegExp("\\b"+rep+"\\b", "ug"), repWith);
  document.getElementById("demo").innerHTML = result;
}
</script>
</body>
</html>

And, whatever solution you could offer, please keep it with the use of variables as you see in the code above (the variable rep above), as these replace words being sought are passed in through function calls.

UPDATE: To try the above code, replace code in here with the code above.

Upvotes: 5

Views: 430

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626758

A \bword\b pattern can be represented as (^|[A-Za-z0-9_])word(?![A-Za-z0-9_]) pattern and when you need to replace the match, you need to add $1 before the replacement pattern.

Since you need to work with Unicode, it makes sense to utilize XRegExp library that supports a "shorthand" \pL notation for any base Unicode letter. You may replace A-Za-z in the above pattern with this \pL:

var str = "الشمس والقمر والنجوم، ثم النجوم والنهار";
var rep = 'النجوم';
var repWith = 'الليل';

var regex = new XRegExp('(^|[^\\pL0-9_])' + rep + '(?![\\pL0-9_])');
var result = XRegExp.replace(str, regex, '$1' + repWith, 'all');
console.log(result);
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>

UPDATE by @mohsenmadi: To integrate in an Angular app, follow these steps:

  1. Issue an npm install xregexp to add the library to package.json
  2. Inside a component, add an import { replace, build } from 'xregexp/xregexp-all.js';
  3. Build the regex with: let regex = build('(^|[^\\pL0-9_])' + rep + '(?![\\pL0-9_])');
  4. Replace with: let result = replace(str, regex, '$1' + repWith, 'all');

Upvotes: 3

user557597
user557597

Reputation:

Incase you change your mind about whitespace boundary's, here is the regex.

var Rx = new RegExp(
   "(^|[\\u0009-\\u000D\\u0020\\u0085\\u00A0\\u1680\\u2000-\\u200A\\u2028-\\u2029\\u202F\\u205F\\u3000])"
   + text +
   "(?![^\\u0009-\\u000D\\u0020\\u0085\\u00A0\\u1680\\u2000-\\u200A\\u2028-\\u2029\\u202F\\u205F\\u3000])"
   ,"ug");

var result = str.replace( Rx, '$1' + repWith );

Regex explanation

 (                             # (1 start), simulated whitespace boundary
      ^                             # BOL
   |                              # or whitespace
      [\u0009-\u000D\u0020\u0085\u00A0\u1680\u2000-\u200A\u2028-\u2029\u202F\u205F\u3000] 
 )                             # (1 end)

 text                          # To find

 (?!                           # Whitespace boundary
      [^\u0009-\u000D\u0020\u0085\u00A0\u1680\u2000-\u200A\u2028-\u2029\u202F\u205F\u3000] 
 )

In an engine that can use lookbehind assertions, a whitespace boundary
is typically done like this (?<!\S)text(?!\S).

Upvotes: 2

Related Questions