Custodio
Custodio

Reputation: 8934

Remove all exclusive Latin characters using regex

I'm developing a Portuguese software, so many of my entities have names like 'maça' or 'lição' and I want to use the entity as a resource key. So I want keep every character except the 'ç,ã,õ....'

There is some optimum solution using regex? My actual regex is (as Remove characters using Regex suggest):

Regex regex = new Regex(@"[\W_]+");
string cleanText = regex.Replace(messyText, "").ToUpper();

only to emphasize, I'm worried just with Latin characters.

Upvotes: 10

Views: 11956

Answers (6)

Kobi
Kobi

Reputation: 138007

A simple option is to white-list the accepted characters:

string clean = Regex.Replace(messy, @"[^a-zA-Z0-9!@#]+", "");

If you want to remove all non-ASCII letters but keep all other characters, you can use character class subtraction:

string clean = Regex.Replace(messy, @"[\p{L}-[a-zA-Z]]+", "");

It can also be written as the more standard and complicated [^\P{L}a-zA-Z]+ (or \W), which reads "select all characters that are not word letters or ASCII letters", which ends up with the letters we're looking for.
Just some context for \W: It stands for "not a word character", meaning anything other than a-z,A-Z,0-9 and underscore _

You may also consider the following approach more useful: How do I remove diacritics (accents) from a string in .NET?

Upvotes: 7

Tergiver
Tergiver

Reputation: 14507

Another option might be to convert from Unicode to ASCII. This will not drop characters, but convert them to ?s. That might be better than dropping them (for use as keys).

string suspect = "lição";
byte[] suspectBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, Encoding.Unicode.GetBytes(suspect));
string purged = Encoding.ASCII.GetString(suspectBytes);
Console.WriteLine(purged); // li??o

Note that the question marks are often unique but unrepresentable characters, so you may get fewer collisions.

Upvotes: 5

Marcelo Rodovalho
Marcelo Rodovalho

Reputation: 923

This is more usefull to me:

([\p{L}]+)

Upvotes: 0

Auri Rahimzadeh
Auri Rahimzadeh

Reputation: 2263

The goal should be to simply include ASCII characters A-Z and numbers and punctuation. Just exclude everything outside of that range using RegEx.

string clean = Regex.Replace(messy, @"[^\x20-\x7e]", String.Empty);

To be clear, the regex I'm using is:

[^\x20-\x7e]

You may need to escape the \ character - I haven't tested this in anything but RegEx buddy :)

That excludes everything outside ASCII characters 0x20 and 0x7e, which translates to ASCII range decimal 32-127.

Good luck!

Best,

-Auri

Upvotes: 2

Ezra
Ezra

Reputation: 7702

I think the best regex would be to use:

[^\x00-\x80]

This is the negation of all ASCII characters. It matches all non-ASCII characters: The \x00 and \x80 (128) is the hexadecimal character code, and - means range. The ^ inside the [ and ] means negation.

Replace them with the empty string, and you should have what you want. It also frees you from worrying about punctuation, and the like, that are not ASCII, and can cause subtle but annoying (and hard to track down) errors.

If you want to use the extended ASCII set as legal characters, you can say \xFF instead of \x80.

Upvotes: 2

Chris Haas
Chris Haas

Reputation: 55417

Does this work?

Regex regex = new Regex(@"[^a-zA-Z0-9_]");

Upvotes: 4

Related Questions