Reputation: 63
I'm trying to filter anything except alphanumeric characters, Russian letters, line breaks, spaces, commas, dots, question marks, exclamation marks, slashes, #
, @
, colons and parentheses.
My regex so far: r"[^ะ-ั\w\d\n ,.?!ั/@#:()]"
However, it does not match the following string: "๐พ๐๐๐๐๐"
.
Why not, and how can I make it do so?
Edit: Forgot to mention that it works as expected at https://regexr.com/
Upvotes: 1
Views: 135
Reputation: 83
You can make it so it only matches the type you need. Instead of the string type that you don't need.
This should work [ะ-ั\w\d\"+\"\n\"+\" ,.?!ั/@#:()]
Upvotes: 0
Reputation: 627537
You may check the string at this link and you will see that the "๐พ๐๐๐๐๐" string consists of characters belonging to \p{L}
category. Your regex starts with [^ะ-ั\w\d
, which means it matches any chars but Russian chars (except ั
(that you define a bit later) and ะ
), any Unicode letters (any because in Python 3, \w
- by default - matches any Unicode alphanumeric chars and connector punctuation.
It appears you only want to remove Russian and English letters, so use the corresponding ranges:
r"[^ะ-ะฏะะฐ-ััA-Za-z0-9\n ,.?!/@#:()]+"
It matches one or more chas other than
ะ-ะฏะะฐ-ัั
- Russian lettersA-Za-z
- ASCII letters0-9
- ASCII digits\n ,.?!/@#:()
- newline, space, comma, dot, question and exclamation marks, slash, ampersand, hash, colon and round parentheses.Upvotes: 1