RD Ward
RD Ward

Reputation: 6737

Regex, every non-alphanumeric character except white space or colon

How can I do this one anywhere?

Basically, I am trying to match all kinds of miscellaneous characters such as ampersands, semicolons, dollar signs, etc.

Upvotes: 225

Views: 349421

Answers (11)

MS Berends
MS Berends

Reputation: 5199

Previous solutions only seem reasonable for English or other Latin-based languages without accents, etc. Those answers are for that reason not generalised to answer the question.

According to the Whitespace character article on Wikipedia, these are all the whitespace characters in Unicode:

U+0009, U+000A, U+000B, U+000C, U+000D, U+0020, U+0085, U+00A0, U+1680, U+180E, U+2000, U+2001, U+2002, U+2003, U+2004, U+2005, U+2006, U+2007, U+2008, U+2009, U+200A, U+200B, U+200C, U+200D, U+2028, U+2029, U+202F, U+205F, U+2060, U+3000, U+FEFF

So in my opinion, the most inclusive solution would be (might be slow, but this is about accuracy):

\u0009\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u200D\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF

Thus, to answer OP's question to include "every non-alphanumeric character except white space or colon", prepend a hat ^ to not include above characters and add the colon to that, and surround the regex in [ and ] to instruct it to 'any of these characters':

"[^:\u0009\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u200D\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF]"

Debuggex Demo


Bonus: solution for R

trimws2 <- function(..., whitespace = "[\u0009\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u200D\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF]") {
  trimws(..., whitespace = whitespace)
}

This is even faster than trimws() itself which sets " \t\n\r".

microbenchmark::microbenchmark(trimws2(" \t\r\n"), trimws(" \t\r\n"))
#> Unit: microseconds
#>                   expr    min     lq     mean  median      uq     max neval cld
#>  trimws2(" \\t\\r\\n") 29.177 29.875 31.94345 30.4990 31.3895 105.642   100  a 
#>   trimws(" \\t\\r\\n") 45.811 46.630 48.25076 47.2545 48.2765 116.571   100   b

Upvotes: 1

its_ me
its_ me

Reputation: 21

[^\w\s-]

Character set of characters which not:

  • Alphanumeric
  • Whitespace
  • Colon

Upvotes: -3

Chris Halcrow
Chris Halcrow

Reputation: 31950

In JavaScript:

/[^\w_]/g

^ negation, i.e. select anything not in the following set

\w any word character (i.e. any alphanumeric character, plus underscore)

_ negate the underscore, as it's considered a 'word' character

Usage example - const nonAlphaNumericChars = /[^\w_]/g;

Upvotes: 8

Ste
Ste

Reputation: 2293

This regex works for C#, PCRE and Go to name a few.

It doesn't work for JavaScript on Chrome from what RegexBuddy says. But there's already an example for that here.

This main part of this is:

\p{L}

which represents \p{L} or \p{Letter} any kind of letter from any language.`


The full regex itself: [^\w\d\s:\p{L}]

Example: https://regex101.com/r/K59PrA/2

Upvotes: 2

Kim-Trinh
Kim-Trinh

Reputation: 79

If you mean "non-alphanumeric characters", try to use this:

var reg =/[^a-zA-Z0-9]/g      //[^abc]

Upvotes: 5

Er Parthu
Er Parthu

Reputation: 20

Try to add this:

^[^a-zA-Z\d\s:]*$

This has worked for me... :)

Upvotes: -3

Topera
Topera

Reputation: 12389

Try this:

[^a-zA-Z0-9 :]

JavaScript example:

"!@#$%* ABC def:123".replace(/[^a-zA-Z0-9 :]/g, ".")

See a online example:

http://jsfiddle.net/vhMy8/

Upvotes: 16

Luke Sneeringer
Luke Sneeringer

Reputation: 9428

This should do it:

[^a-zA-Z\d\s:]

Upvotes: 47

Nick F
Nick F

Reputation: 10112

If you want to treat accented latin characters (eg. à Ñ) as normal letters (ie. avoid matching them too), you'll also need to include the appropriate Unicode range (\u00C0-\u00FF) in your regex, so it would look like this:

/[^a-zA-Z\d\s:\u00C0-\u00FF]/g
  • ^ negates what follows
  • a-zA-Z matches upper and lower case letters
  • \d matches digits
  • \s matches white space (if you only want to match spaces, replace this with a space)
  • : matches a colon
  • \u00C0-\u00FF matches the Unicode range for accented latin characters.

nb. Unicode range matching might not work for all regex engines, but the above certainly works in Javascript (as seen in this pen on Codepen).

nb2. If you're not bothered about matching underscores, you could replace a-zA-Z\d with \w, which matches letters, digits, and underscores.

Upvotes: 29

Vasyl Gutnyk
Vasyl Gutnyk

Reputation: 5039

No alphanumeric, white space or '_'.

var reg = /[^\w\s)]|[_]/g;

Upvotes: 4

Tudor Constantin
Tudor Constantin

Reputation: 26861

[^a-zA-Z\d\s:]
  • \d - numeric class
  • \s - whitespace
  • a-zA-Z - matches all the letters
  • ^ - negates them all - so you get - non numeric chars, non spaces and non colons

Upvotes: 409

Related Questions