chrismat
chrismat

Reputation: 267

Regular expression to remove special characters while retaining a valid email format

I am using this in C#. I start with an email-like string in this format:

employee[any characters]@company[any characters].com

I want to strip non-alphanumerics from the [any characters] pieces.

For example I want this "employee1@2 r&a*d.m32@@company98 ';99..com"

to become this "[email protected]"

This expression simply takes all of the specials away, but I want to leave a single @ before company and a single . before com. So I need the expression to ignore or mask out the employee, @company, and .com pieces... just not sure how to do that.

var regex = new Regex("[^0-9a-zA-Z]"); //whitelist the acceptables, remove all else.

Upvotes: 0

Views: 2865

Answers (5)

dognose
dognose

Reputation: 20899

You can use the following regex:

(?:\W)(?!company|com)

It will replace any special char, unless it is followed by company (so @company will remain) or com (so .com will remain):

employee1@2 r&a*d.m32@@company98 ';99..com

will become

[email protected]

See: http://regex101.com/r/fY8jD7/2

Note that you need the g modifier to replace all occurences of such an unwanted character. this is default in C#, so you just can use a simple Regex.Replace():

https://dotnetfiddle.net/iTeZ4F


Update:

ofc. the regex (?:\W)(?!com) would be enough - but it will still leave parts like #com or ~companion since they match as well. So tis is still not a guarantee that the input - or lets say the conversion - is 100% valid. You should consider to simply throw a validation error, instead of trying to sanitize the input to match your needs.

Even if you would manage to handle this cases as well - what to do, if @company or .com appears two times?

Upvotes: 3

Jeremy K
Jeremy K

Reputation: 543

@dognose gave a great regex solution. I'll keep my answer here as a reference but I would go with his as it's much shorter/cleaner.

var companyName = "company";
var extension = "com";
var email = "employee1@2 r&a*d.m32@@company98 ';99..com";

var tempEmail = Regex.Replace(email, @"\W+", "");

var companyIndex = tempEmail.IndexOf(companyName);
var extIndex = tempEmail.LastIndexOf(extension);

var fullEmployeeName = tempEmail.Substring(0, companyIndex);
var fullCompanyName = tempEmail.Substring(companyIndex, extIndex - companyIndex);

var validEmail = fullEmployeeName + "@" + fullCompanyName + "." + extension;

Upvotes: 0

Arian Motamedi
Arian Motamedi

Reputation: 7423

What you're trying to do is, though possible, a little bit complicated using one single regex pattern. You can break this scenario down into smaller steps. One way of doing it is to extract the Username and Domain groups (essentially what you described as [any character]), "fix" each group, and replace it with the original. Something like this:

// Original input to transform.
string input = @"employee1@2 r&a*d.m32@@company98 ';99..com";

// Regular expression to find and extract "Username" and "Domain" groups, if any.
var matchGroups = Regex.Match(input, @"employee(?<UsernameGroup>(.*))@company(?<DomainGroup>(.*)).com");

string validInput = input;

// Get the username group from the list of matches.
var usernameGroup = matchGroups.Groups["UsernameGroup"];

if (!string.IsNullOrEmpty(usernameGroup.Value))
{
    // Replace non-alphanumeric values with empty string.
    string validUsername = Regex.Replace(usernameGroup.Value, "[^a-zA-Z0-9]", string.Empty);

    // Replace the the invalid instance with the valid one.
    validInput = validInput.Replace(usernameGroup.Value, validUsername);
}

// Get the domain group from the list of matches.
var domainGroup = matchGroups.Groups["DomainGroup"];

if (!string.IsNullOrEmpty(domainGroup.Value))
{
    // Replace non-alphanumeric values with empty string.
    string validDomain = Regex.Replace(domainGroup.Value, "[^a-zA-Z0-9]", string.Empty);

    // Replace the the invalid instance with the valid one.
    validInput = validInput.Replace(domainGroup.Value, validDomain);
}

Console.WriteLine(validInput);

will output [email protected].

Upvotes: 0

Erik Philips
Erik Philips

Reputation: 54638

I would probably write something like:

(ignoring case sensitivity, if you need case sensitivity please comment).

DotNetFiddle Example

using System;
using System.Linq;

public class Program
{
    public static void Main()
    {
        var email = "employee1@2 r&a*d.m32@@company98 ';99..com";

        var result = GetValidEmail(email);

        Console.WriteLine(result);
    }


    public static string GetValidEmail(string email)
    {
      var result = email.ToLower();

      // Does it contain everything we need?
      if (email.StartsWith("employee")
          && email.EndsWith(".com")
          && email.Contains("@company"))
      {
        // remove beginning and end.
        result = result.Substring(8, result.Length - 13);
        // remove @company
        var split = result.Split(new string[] { "@company" },
          StringSplitOptions.RemoveEmptyEntries);

        // validate we have more than two (you may not need this)
        if (split.Length != 2)
        {
          throw new ArgumentException("Invalid Email.");
        }

        // recreate valid email
        result = "employee"
          + new string (split[0].Where(c => char.IsLetterOrDigit(c)).ToArray())
          + "@company"
          + new string (split[1].Where(c => char.IsLetterOrDigit(c)).ToArray())
          + ".com";

      }
      else
      {
        throw new ArgumentException("Invalid Email.");
      }

      return result;
    }
}

Result

[email protected]

Upvotes: 0

TOP KEK
TOP KEK

Reputation: 2651

You can simplify your regex and replace it by

tmp = Regex.Replace(n, @"\W+", "");

where \w means all letters, digits, and underscores, and \W is the negated version of \w. In general it is better to create a whitelist of allowed characters instead of trying to predict all not allowed symbols.

Upvotes: 0

Related Questions