Shaggydog
Shaggydog

Reputation: 3788

Regex for catching word with special characters between letters

I am new to regex, I'm programming an advanced profanity filter for a commenting feature (in C#). Just to save time, I know that all filters can be fooled, no matter how good they are, you don't have to tell me that. I'm just trying to make it a bit more advanced than basic word replacement. I've split the task into several separate approaches and this is one of them.

What I need is a specific piece of regex, that catches strings such as these:

s_h_i_t
s h i t
S<>H<>I<>T
s_/h_/i_/t
s***h***i***t

you get the idea. I guess what I'm looking for is a regex that says "one or more characters that are not alphanumeric". This should include both spaces and all special characters that you can type on a standard (western) keyboard. If possible, it should also include line breaks, so it would catch things like

s
h
i
t

There should always be at least one of the characters present, to avoid likely false positives such as in

Finish it.

This will of course mean that things like

sh_it

will not be caught, but as I said, it doesn't matter, it doesn't have to be perfect. All I need is the regex, I can do the splitting of words and inserting the regex myself. I have the RegexOptions.IgnoreCase option set in my C# code, so character case in the actual word is not an issue. Also, this regex shouldn't worry about "leetspeek", i.e. some of the actual letters of the word being replaced by other characters:

sh1t

I have a different approach that deals with that. Thank you in advance for your help.

Upvotes: 1

Views: 1914

Answers (4)

Arie
Arie

Reputation: 5373

\bs[\W_]*h[\W_]*i[\W_]*t[\W_]*(?!\w)

  • matches characters between letters that aren't word characters or character _ or whitespace characters (also new line breaks)

  • \b (word boundrary) ensures that Finish it won't match

  • (?!\w) ensures that sh ituuu wont match, you may want to remove/modify that, as s_hittt will not match as well. \bs[\W_]*h[\W_]*i[\W_]*t+[\W_]*(?!\w) will match the word with repeated last character

  • modification \bs[\W_]*h[\W_]*i[\W_]*t[\W_]*?(?!\w) will make the match of last character class not greedy and in sh it&&& only sh it will match

  • \bs[\W\d_]*h[\W\d_]*i[\W\d_]*t+[\W\d_]*?(?!\w) will match sh1i444t (digits between characters)

EDIT:

(?!\w) is a negative lookahead. It basicly checks if your match is followed by a word character (word characters are [A-z09_]). It has a length of 0, which means it won't be included in the match. If you want to catch words like "shi*tface" you'll have to remove it. ( http://www.regular-expressions.info/lookaround.html )

A word booundrary [/b] matches a place where word starts or ends, it's length is 0, which means that it matches between characters

[\W] is a negative character class, I think it's equal to [^a-zA-Z0-9_] or [^\w]

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626926

You want to match words where each letter is separated with the identical non-word char(s).

You can use

\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b

See the regex demo. (I added (?!\n) to make the regex work for each line as if it were a separate string.) Details:

  • \b - word boundary
  • \p{L} - a letter
  • (?=([\W_]+)) - a positive lookahead that matches a location that is immediately followed with any non-word or _ char (captured into Group 1)
  • (?:\1\p{L})+ - one or more repetitions of a sequence of the same char captured into Group 1 and a letter
  • \b - word boundary.

To check if there is such a pattern in a string, you can use

var HasSpamWords = Regex.IsMatch(text, @"\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b");

To return all occurrences in a string, you can use

var results = Regex.Matches(text, @"\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b")
    .Cast<Match>()
    .Select(x => x.Value)
    .ToList();

See the C# demo.

Getting the length of each string is easy if you get Match.Length and use .Select(x => x.Length). If you need to get the length of the string with all special chars removed, simply use .Select(x => x.Value.Count(c => char.IsLetter(c))) (see this C# demo).

Upvotes: 0

Shaggydog
Shaggydog

Reputation: 3788

Alright, HamZa's answer worked. However I ran into a programmatic problem while working on the solution. When I was replacing just the words, I always knew the length of the word. So I knew exactly how many asterisks to replace it with. If I'm matching shit, I know I need to put 4 asterisks. But if I'm matching s[^a-z0-9]+h[^a-z0-9]+[^a-z0-9]+i[^a-z0-9]+t, I might catch s#h#i#t or I may catch s------h------i--------t. In both cases the length of the matched text will differ wildly from that of the pattern. How can I get the actual length of the matched string?

Upvotes: 2

Aziz Shaikh
Aziz Shaikh

Reputation: 16524

Lets see if this regex works for you:

/\w(?:_|\W)+/

Upvotes: 2

Related Questions