sametbilgi
sametbilgi

Reputation: 602

Bad-words filter with special characters

I am using https://www.npmjs.com/package/bad-words and i created regex for filter special characters.

const Filter = require('bad-words');
const badWordsFilter = new Filter({replaceRegex:  /[A-Za-z0-9öÖÇ窺ĞğİıÜü_]/g});
badWordsFilter.addWords(['badword', 'şğ'])

If word doesn't contain turkish character, it works. But if i write turkish character like ş or ğ it is not filtering.

Is my regex wrong?

I found this code in documentation:

var filter = new Filter({ regex: /\*|\.|$/gi });
var filter = new Filter({ replaceRegex:  /[A-Za-z0-9가-힣_]/g }); 
//multilingual support for word filtering

Upvotes: 4

Views: 2015

Answers (4)

antoni
antoni

Reputation: 5556

You obviously have an encoding problem since your regex works out of your app, see here: https://regex101.com/r/VpItfH/3/.

So I think encoding your characters in your regex in your app may help:

See the encoded regex result here: https://regex101.com/r/VpItfH/4/


More details

Trying the following encoded regex in a PCRE regex engine will work (https://regex101.com/r/VpItfH/5):

/[A-Za-z0-9\x{f6}\x{d6}\x{c7}\x{e7}\x{15e}\x{15f}\x{11e}\x{11f}\x{130}\x{131}\x{dc}\x{fc}_]/g

but when selecting a javascript regex engine the {,} will break the unicode so you need to remove them and if the character is not recognized then replace \x with \u0. E.g. \x{15e} becomes \u015e

Then you can do the same match as when you use /[A-Za-z0-9öÖÇ窺ĞğİıÜü_]/g.

Note: to get the unicode form of a character, you can do "Ğ".charCodeAt(0).toString(16); and prefix it with \x or \u0.

Hope this can help, and at least acknowledge that you can encode characters inside a regex and still match the same. :)

Upvotes: 2

Just a student
Just a student

Reputation: 11050

You need to make that regular expression Unicode-aware by adding the u flag to it. More precisely, change /[A-Za-z0-9öÖÇ窺ĞğİıÜü_]/g into /[A-Za-z0-9öÖÇ窺ĞğİıÜü_]/gu (added a u at the end). This will work only in modern browsers (basically, all but Internet Explorer) though. There are other options as well, that you may want to consider if you want to support older browsers.

Upvotes: 1

Jordan Enev
Jordan Enev

Reputation: 18684

Can you please try with:

var filter = new Filter({ replaceRegex: /(\w+)/gi });

For sure you have to use replaceRegex option.


The pattern matches everything case insentively.

Here's what /(\w+)/gi does descriptively (thanks to regex101):

  1. 1st Capturing Group (\w+).
    1. \w+ matches any word character (equal to [a-zA-Z0-9_])
    2. + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
  2. Global pattern flags
    1. i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
    2. g modifier: global. All matches (don't return after first match)

Upvotes: 1

Andrew Li
Andrew Li

Reputation: 1055

Encode your javascript file into utf-8 and update your meta tag to:

<meta http-equiv="content-type" content="text/html;charset=utf-8" />

Hoping this will help you.

Upvotes: 0

Related Questions