maximxls
maximxls

Reputation: 63

Why my regex matches non-ascii characters?

I'm trying to filter anything except alphanumeric characters, Russian letters, line breaks, spaces, commas, dots, question marks, exclamation marks, slashes, #, @, colons and parentheses.

My regex so far: r"[^ะ-ั\w\d\n ,.?!ั‘/@#:()]"

However, it does not match the following string: "๐•พ๐–๐–Ž๐–—๐–”๐–“". Why not, and how can I make it do so?

Edit: Forgot to mention that it works as expected at https://regexr.com/

Upvotes: 1

Views: 135

Answers (2)

sani bani
sani bani

Reputation: 83

You can make it so it only matches the type you need. Instead of the string type that you don't need.

This should work [ะ-ั\w\d\"+\"\n\"+\" ,.?!ั‘/@#:()]

Upvotes: 0

Wiktor Stribiลผew
Wiktor Stribiลผew

Reputation: 627537

You may check the string at this link and you will see that the "๐•พ๐–๐–Ž๐–—๐–”๐–“" string consists of characters belonging to \p{L} category. Your regex starts with [^ะ-ั\w\d, which means it matches any chars but Russian chars (except ั‘ (that you define a bit later) and ะ), any Unicode letters (any because in Python 3, \w - by default - matches any Unicode alphanumeric chars and connector punctuation.

It appears you only want to remove Russian and English letters, so use the corresponding ranges:

r"[^ะ-ะฏะะฐ-ัั‘A-Za-z0-9\n ,.?!/@#:()]+"

It matches one or more chas other than

  • ะ-ะฏะะฐ-ัั‘ - Russian letters
  • A-Za-z - ASCII letters
  • 0-9 - ASCII digits
  • \n ,.?!/@#:() - newline, space, comma, dot, question and exclamation marks, slash, ampersand, hash, colon and round parentheses.

Upvotes: 1

Related Questions