waldyr.ar
waldyr.ar

Reputation: 15204

Different encoding changed to utf not matching in regex

I recently discovered some flaws with my users. Some of the emails registered had some characters with different encodings others than UTF-8. So I'm trying to clean all those emails with gsub. By now I'm trying to capture all records with flaws using this regex. Explanation abou the regex: http://regexr.com/3bati

/\A[^@\s]+@([^@\s]+\.)+[^@\W]+\z/

But I'm not able to capture the following string which I inserted in the database as a flag

"\[email protected]".encode('utf-8')

How can I improve this regex to improve my validation and do not let encodings ruin my login?

Upvotes: 1

Views: 154

Answers (1)

Aleksei Matiushkin
Aleksei Matiushkin

Reputation: 121000

As I understood your task, you want to make sure, that the email was entered by the user is what she wanted to enter. I would go with:

"\[email protected]".gsub(/[^\p{ASCII}]/, '').encode('ISO-8859-1')

First of all, you don’t need to assure it’s a valid email. The task differs. Secondary, all non-ascii should be filtered out. That’s likely it.

Of course, you might apply any further email validation check.

NB: #.encode in the end is done to assure there is a valid ISO-8859-1 string left after a sanitarization.

Upvotes: 1

Related Questions