matti157
matti157

Reputation: 1288

Replace in a string all characters outside the set Windows-1252

Having to maintain old programs written in VB6, I find myself having this issue.

I need to find an efficient way to search a string for all characters OUTSIDE the Windows-1252 set and replace them with "_". I can do this in C#

So far I have done this by creating a string with all 1252 characters, is there a faster way?

I may have to do this for a few million records in a text file

string 1252chars = ""!\""#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶•¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿŸžœ›š™˜—–•""’’ŽŽ‹Š‰vˆ‡†…„ƒ‚€ ""

//Replace all characters not in the string above...

Upvotes: 0

Views: 467

Answers (2)

Codo
Codo

Reputation: 78795

The Encoding class can achieve this, most likely very efficiently. When converting to and from the encoding, a replacement character can be specified.

using System;
using System.Text;
                    
public class Program
{
    public static void Main()
    {
        // For .NET core only:
        // Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

        var text = "abc絵de😂fgh";
        text = Win1252Safe(text);
        Console.WriteLine(text);
    }
    
    private static Encoding Win1252R = Encoding.GetEncoding(1252,
                                  new EncoderReplacementFallback("_"),
                                  new DecoderReplacementFallback("_"));
    
    public static string Win1252Safe(string text) {
        var bytes = Win1252R.GetBytes(text);
        return Win1252R.GetString(bytes);
    }
}

Output

abc_de__fgh

Upvotes: 1

Sebastian Siemens
Sebastian Siemens

Reputation: 2421

Have you tried to normalize the string? string.Normalize() method is used to remove all characters that are not part of the Windows-1252 character set. https://learn.microsoft.com/de-de/dotnet/api/system.string.normalize?view=net-7.0

string inputString = "Some input string";
string outputString = inputString.Normalize(NormalizationForm.FormD);

Alternatively, you can use a loop to check each character of the string and remove the characters that are not in the Windows-1252 set using the StringBuilder class.

string inputString = "Some input string";
StringBuilder sb = new StringBuilder();
foreach (char c in inputString)
{
    if (c <= '\u00FF')
    {
        sb.Append(c);
    }
}
string outputString = sb.ToString();

Upvotes: 1

Related Questions