Reputation: 1288
Having to maintain old programs written in VB6, I find myself having this issue.
I need to find an efficient way to search a string for all characters OUTSIDE the Windows-1252 set and replace them with "_". I can do this in C#
So far I have done this by creating a string with all 1252 characters, is there a faster way?
I may have to do this for a few million records in a text file
string 1252chars = ""!\""#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶•¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿŸžœ›š™˜—–•""’’ŽŽ‹Š‰vˆ‡†…„ƒ‚€ ""
//Replace all characters not in the string above...
Upvotes: 0
Views: 467
Reputation: 78795
The Encoding
class can achieve this, most likely very efficiently. When converting to and from the encoding, a replacement character can be specified.
using System;
using System.Text;
public class Program
{
public static void Main()
{
// For .NET core only:
// Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
var text = "abc絵de😂fgh";
text = Win1252Safe(text);
Console.WriteLine(text);
}
private static Encoding Win1252R = Encoding.GetEncoding(1252,
new EncoderReplacementFallback("_"),
new DecoderReplacementFallback("_"));
public static string Win1252Safe(string text) {
var bytes = Win1252R.GetBytes(text);
return Win1252R.GetString(bytes);
}
}
Output
abc_de__fgh
Upvotes: 1
Reputation: 2421
Have you tried to normalize the string? string.Normalize()
method is used to remove all characters that are not part of the Windows-1252 character set. https://learn.microsoft.com/de-de/dotnet/api/system.string.normalize?view=net-7.0
string inputString = "Some input string";
string outputString = inputString.Normalize(NormalizationForm.FormD);
Alternatively, you can use a loop to check each character of the string and remove the characters that are not in the Windows-1252 set using the StringBuilder class.
string inputString = "Some input string";
StringBuilder sb = new StringBuilder();
foreach (char c in inputString)
{
if (c <= '\u00FF')
{
sb.Append(c);
}
}
string outputString = sb.ToString();
Upvotes: 1