AlexD
AlexD

Reputation: 4310

Regex for matching HashTags in any language

I have a field in my application where users can enter a hashtag. I want to validate their entry and make sure they enter what would be a proper HashTag. It can be in any language and it should NOT precede with the # sign. I am writing in JavaScript.

So the following are GOOD examples:

And the following are BAD examples:

We had a regex that matched only a-zA-Z0-9, we needed to add language support so we changed it to ignore white spaces and forgot to ignore special characters, so here I am.

Some other StackOverflow examples I saw but didn't work for me:

  1. Other languges don't work
  2. Again, English only

[edit]

Upvotes: 7

Views: 3709

Answers (4)

lambneck
lambneck

Reputation: 19

/#[\p{L}\p{N}_]+/gu

This works for me, and addresses many of the concerns mentioned in comments.

Upvotes: 0

Mahbubur Rahman Khan
Mahbubur Rahman Khan

Reputation: 415

First if we exclude all symbol it will not a handy solution. Because symbol depends on keyboard layout and there are hundreds of math symbols and so on. So use this..

[\p{sc=Bengali}|\p{L}_\p{N}]+

1. If you think if language need extra care include like \p{sc=Bengali}|\p{sc=Spanish} etc. Suppose bangla has surrogate alphabet like া, ে ৌ etc so codepoint need to recognize Bangla separately first by \p{sc=Bengali}

2. Than use \p{L} that matches anything that is a Unicode letter a-z and letters like é,ü,ğ,i,ç too or normal any alphabet without complex...matches a single code point in the category "letter"

3. _ underscore allowed

4. \p{N} matches any kind of numeric character in any script. (\d matches only a digit (equal to [0-9]) but for allowed Unicode digit \p{N} only option, because its works with any digit codepoint)

Upvotes: 0

Aerodynamika
Aerodynamika

Reputation: 8423

I don't understand why this question does not get more votes. Hashtag detection for multiple languages is a problem. The only working option I could find is posted by Lucas above (all other ones do not work so well).

It needs a modification though:

#[^\s!@#$%^&*()=+.\/,\[{\]};:'"?><]+

DEMO

this detects all the hashtags, not only in the beginning of the string, fixes an unescaped character, and removes the unnecessary $ in the end.

Upvotes: 4

Lucas Trzesniewski
Lucas Trzesniewski

Reputation: 51430

If your disallowed characters list is thorough (!@#$%^&*()=+./,[{]};:'"?><), then the regex is:

^#?[^\s!@#$%^&*()=+./,\[{\]};:'"?><]+$

Demo

This allows an optional leading # sign: #?. It disallows the special characters using a negative character class. I just added \s to the list (spaces), and also I escaped [ and ].

Unfortunately, you can't use constructs like \p{P} (Unicode punctuation) in JavaScript's regexes, so you basically have to blacklist characters or take a different approach if the regex solution isn't good enough for your needs.

Upvotes: 5

Related Questions