Reputation: 375
I have been working on creating a Product Feed for a third party company. The data I am working with has all sorts on invalid, special characters, double spacing, etc. They have also requested that the data is HTML encoded, where special characters are used.
An example of some data that would be passed = "Buy Kitchen
Aid Artisan™ Stand Mixer 4.8L "
try
{
var removeDoubleSpace = Regex.Replace(stringInput, @"\s+", " ");
var encodedString = HttpUtility.HtmlEncode(removeDoubleSpace).Trim();
var encodedAndLineBreaksRemoved = encodedString.Replace(Environment.NewLine, "");
var finalStringOutput = Regex.Replace(encodedAndLineBreaksRemoved, @"(™)|(’)|(”)|(–)", "");
return finalStringOutput;
}
catch (Exception)
{
return stringInput;
}
I was trying to come up with one method that could be called, to do all the above, in a cleaner way rather than several Regex
expressions. Or, perhaps, is there just one regex that covers everything?
Upvotes: 3
Views: 1713
Reputation: 11228
You don't need regex, linq will do as well:
var str = "Buy Kitchen Aid Artisan™ Stand Mixer 4.8L";
var newStr = new string(str.Where(c => !Char.IsSymbol(c)).ToArray());
Console.WriteLine(newStr); // Buy Kitchen Aid Artisan Stand Mixer 4.8L
Upvotes: 0
Reputation: 141442
Use a white list not a blacklist, because you can more easily know which letters are acceptable than which letters might be there that are unacceptable. A white list is just that. It's a list of acceptable characters. Create your white list, and remove everything that is not on that list. In your case, a potential white list could include all ASCII characters.
The following is a white list that captures all alphanumeric and punctuation characters.
using System;
using System.Text;
using System.Text.RegularExpressions;
public class Program
{
private static string input = @"Buy Kitchen
Aid Artisan™ Stand Mixer 4.8L ";
public static void Main()
{
var match = Regex
.Match(input, @"[a-zA-Z0-9\p{P}]+");
StringBuilder builder = new StringBuilder();
while(match.Success)
{
// add a space between matches
builder.Append(match + " ");
match = match.NextMatch();
}
Console.WriteLine(builder.ToString());
}
}
Output
Buy Kitchen Aid Artisan Stand Mixer 4.8L
Upvotes: 2
Reputation: 626689
Here is a bit enhanced code:
var removeDoubleSpace = Regex.Replace(stringInput, @"\s+", " ");
var encodedString = System.Web.HttpUtility.HtmlEncode(removeDoubleSpace).Trim().Replace("™", string.Empty).Replace("’", string.Empty).Replace("”", string.Empty).Replace("–", string.Empty);
You do not need to use var encodedAndLineBreaksRemoved = encodedString.Replace(Environment.NewLine, "");
since newline symbols have been already removed with \s+
regex (\s
matches any white space character including space, tab, form-feed, and so on. Equivalent to [ \f\n\r\t\v].).
Also, there is no need using a 2nd regex unless you plan to remove a certain range of characters, or a class (e.g. all characters inside \p{S}
shorthand class), thus, I just chained several string.Replace
methods, right to the trimmed and encoded string.
Output:
Buy Kitchen Aid Artisan Stand Mixer 4.8L
Upvotes: 0