Reputation: 108
I'm trying to create a validation rules for username in two steps:
I thought it would be easy, but I was wrong...
str.replace(/[A-Za-z0-9\s]/g, '')
With such rule from "Xxx z 88A ююю 4$??!!" I will get "ююю$??!!". But how to remove all symbols ("ююю" should stay)?
Summary: My main problem is to detect non latin characters and separate them from special symbols.
UPDATE: Ok, for my second case I can use:
str.replace(/[\u0250-\ue007]/g, '').replace(/[A-Za-z0-9-_`\s]/g, '')
It works, but looks dirty... Pardon for backticks.
Upvotes: 1
Views: 1220
Reputation: 13376
The 2 two cases could be solved as follows ...
The first case boils down to ... "allow just non latin / [ascii] letters" ... which could be achieved by ...
/[^\p{L}]+/gu
/[a-zA-Z]+/g
The second case allows "just any of letter, number and whitespace as well as underscore and dash" ... which gets achieved best by ...'
\p{L}
nor number/\p{N}
nor whitespace/\p{Z}
nor underscore nor dash ... /[^\p{L}\p{N}\p{Z}_-]+/gu
In addition the OP could read about regex unicode escapes.
const testSample = 'Xxx z_88A-ююю 4$??!!';
console.log(
'1st case ... allow just non ascii letters ...', {
testSample,
result: testSample
// remove any non letter character sequence ...
.replace(/[^\p{L}]+/gu, '')
// ... then remove any ascii letter sequence.
.replace(/[a-zA-Z]+/g, ''),
},
);
console.log(
'2nd case ... allow any letter, number and whitespace as well as underscore and dash ...', {
testSample,
result: testSample
// remove any character sequence which contains neither letter/`\p{L}`
// nor number/`\p{N}` nor whitespace/`\p{Z}` nor underscore nor dash.
.replace(/[^\p{L}\p{N}\p{Z}_-]+/gu, ''),
},
);
.as-console-wrapper { min-height: 100%!important; top: 0; }
Upvotes: 2
Reputation: 5075
So instead of matching the "forbidden" characters by specifying them individually of as range, you could simply invert the match of the allowed characters:
For case one this would be (as I understood it)
[^A-Za-z0-9,.%$^#@$_-]
That little ^
as first character of the character class (inside the []
) inverts the rest of the character class, meaning: match anything except those characters.
Just make sure to keep the -
as last character inside the character class when you want to match/not match literally that one and don't define a range.
And for case two you could similarly specify only the allowed characters. Unfortunately I did not really understand, what you meant with "whitelist" and where you want to remove or keep what.
Upvotes: 1
Reputation: 11592
For the first problem, eliminating a-z, 0-9, whitespace, symbols and puncutation, you need to know some unicode tricks.
you can reference unicode sets using the \p
option. Symbols are S, punctuation is P.
to use this magic, you need to add the u
modifier to the regex.
That gives us:
/([a-z0-9]|\s|\p{S}|\p{P})/giu
(I added the i
because then I don't have to write A-Z as well as a-z.)
Since you have a solution for your second problem, I'll leave that with you.
Upvotes: 2