Reputation: 267
I am using this in C#. I start with an email-like string in this format:
employee[any characters]@company[any characters].com
I want to strip non-alphanumerics from the [any characters] pieces.
For example I want this "employee1@2 r&a*d.m32@@company98 ';99..com"
to become this "[email protected]"
This expression simply takes all of the specials away, but I want to leave a single @ before company and a single . before com. So I need the expression to ignore or mask out the employee, @company, and .com pieces... just not sure how to do that.
var regex = new Regex("[^0-9a-zA-Z]"); //whitelist the acceptables, remove all else.
Upvotes: 0
Views: 2865
Reputation: 20899
You can use the following regex:
(?:\W)(?!company|com)
It will replace any special char, unless it is followed by company
(so @company
will remain) or com
(so .com
will remain):
employee1@2 r&a*d.m32@@company98 ';99..com
will become
[email protected]
See: http://regex101.com/r/fY8jD7/2
Note that you need the g
modifier to replace all occurences of such an unwanted character.
this is default in C#, so you just can use a simple Regex.Replace()
:
https://dotnetfiddle.net/iTeZ4F
Update:
ofc. the regex (?:\W)(?!com)
would be enough - but it will still leave parts like #com
or ~companion
since they match as well. So tis is still not a guarantee that the input - or lets say the conversion - is 100% valid. You should consider to simply throw a validation error, instead of trying to sanitize the input to match your needs.
Even if you would manage to handle this cases as well - what to do, if @company
or .com
appears two times?
Upvotes: 3
Reputation: 543
@dognose gave a great regex solution. I'll keep my answer here as a reference but I would go with his as it's much shorter/cleaner.
var companyName = "company";
var extension = "com";
var email = "employee1@2 r&a*d.m32@@company98 ';99..com";
var tempEmail = Regex.Replace(email, @"\W+", "");
var companyIndex = tempEmail.IndexOf(companyName);
var extIndex = tempEmail.LastIndexOf(extension);
var fullEmployeeName = tempEmail.Substring(0, companyIndex);
var fullCompanyName = tempEmail.Substring(companyIndex, extIndex - companyIndex);
var validEmail = fullEmployeeName + "@" + fullCompanyName + "." + extension;
Upvotes: 0
Reputation: 7423
What you're trying to do is, though possible, a little bit complicated using one single regex pattern. You can break this scenario down into smaller steps. One way of doing it is to extract the Username
and Domain
groups (essentially what you described as [any character]
), "fix" each group, and replace it with the original. Something like this:
// Original input to transform.
string input = @"employee1@2 r&a*d.m32@@company98 ';99..com";
// Regular expression to find and extract "Username" and "Domain" groups, if any.
var matchGroups = Regex.Match(input, @"employee(?<UsernameGroup>(.*))@company(?<DomainGroup>(.*)).com");
string validInput = input;
// Get the username group from the list of matches.
var usernameGroup = matchGroups.Groups["UsernameGroup"];
if (!string.IsNullOrEmpty(usernameGroup.Value))
{
// Replace non-alphanumeric values with empty string.
string validUsername = Regex.Replace(usernameGroup.Value, "[^a-zA-Z0-9]", string.Empty);
// Replace the the invalid instance with the valid one.
validInput = validInput.Replace(usernameGroup.Value, validUsername);
}
// Get the domain group from the list of matches.
var domainGroup = matchGroups.Groups["DomainGroup"];
if (!string.IsNullOrEmpty(domainGroup.Value))
{
// Replace non-alphanumeric values with empty string.
string validDomain = Regex.Replace(domainGroup.Value, "[^a-zA-Z0-9]", string.Empty);
// Replace the the invalid instance with the valid one.
validInput = validInput.Replace(domainGroup.Value, validDomain);
}
Console.WriteLine(validInput);
will output [email protected]
.
Upvotes: 0
Reputation: 54638
I would probably write something like:
(ignoring case sensitivity, if you need case sensitivity please comment).
using System;
using System.Linq;
public class Program
{
public static void Main()
{
var email = "employee1@2 r&a*d.m32@@company98 ';99..com";
var result = GetValidEmail(email);
Console.WriteLine(result);
}
public static string GetValidEmail(string email)
{
var result = email.ToLower();
// Does it contain everything we need?
if (email.StartsWith("employee")
&& email.EndsWith(".com")
&& email.Contains("@company"))
{
// remove beginning and end.
result = result.Substring(8, result.Length - 13);
// remove @company
var split = result.Split(new string[] { "@company" },
StringSplitOptions.RemoveEmptyEntries);
// validate we have more than two (you may not need this)
if (split.Length != 2)
{
throw new ArgumentException("Invalid Email.");
}
// recreate valid email
result = "employee"
+ new string (split[0].Where(c => char.IsLetterOrDigit(c)).ToArray())
+ "@company"
+ new string (split[1].Where(c => char.IsLetterOrDigit(c)).ToArray())
+ ".com";
}
else
{
throw new ArgumentException("Invalid Email.");
}
return result;
}
}
Result
Upvotes: 0
Reputation: 2651
You can simplify your regex and replace it by
tmp = Regex.Replace(n, @"\W+", "");
where \w
means all letters, digits, and underscores, and \W
is the negated version of \w
.
In general it is better to create a whitelist of allowed characters instead of trying to predict all not allowed symbols.
Upvotes: 0