Wex
Wex

Reputation: 4686

Regular Expression to match #hashtag but not #hashtag; (with semicolon)

I have the current regular expression:

/(?<=[\s>]|^)#(\w*[A-Za-z_]+\w*)/g

Which I'm testing against the string:

Here's a #hashtag and here is #not_a_tag; which should be different. Also testing: Mid#hash. #123 #!@£ and <p>#hash</p>

For my purposes there should only be two hashtags detected in this string. I'm wondering how to alter the expression such that it doesn't match hashtags that end with a ; in my example this is #not_a_tag;

Cheers.

Upvotes: 18

Views: 52872

Answers (8)

Jesse Dirisu
Jesse Dirisu

Reputation: 191

Try this regular expression /#[\w\d(_)]+\b/g

Upvotes: 0

SVG-Heart
SVG-Heart

Reputation: 180

(?<=(\s|^))#[^\s\!\@\#\$\%\^\&\*\(\)]+(?=(\s|$))

A regex code that matches any hashtag.

In this approach any character is accepted in hashtags except main signs !@#$%^&*()

Usage Notes

Turn on "g" and "m" flags when using!

It is tested for Java and JavaScript languages via https://regex101.com and VSCode tools.

It is available on this repo.

Upvotes: 1

Ajay Lingayat
Ajay Lingayat

Reputation: 1673

You could try this pattern : /#\S+/

It will include all characters after # except for spaces.

Upvotes: 1

nhCoder
nhCoder

Reputation: 588

This is the best practice.

(#+[a-zA-Z0-9(_)]{1,})

Upvotes: 12

ne4istb
ne4istb

Reputation: 672

/(#(?:[^\x00-\x7F]|\w)+)/g

Starts with #, then at least one (+) ANCII symbols ([^\x00-\x7F], range excluding non-ANCII symbols) or word symbol (\w).

This one should cover cases including ANCII symbols like "#їжак".

Upvotes: 9

tk78
tk78

Reputation: 957

How about the following:

\B(\#[a-zA-Z]+\b)(?!;)

Regex Demo

  • \B -> Not a word boundary
  • (#[a-zA-Z]+\b) -> Capturing Group beginning with # followed by any number of a-z or A-Z with a word boundary at the end
  • (?!;) -> Not followed by ;

Upvotes: 39

garyh
garyh

Reputation: 2852

Similar to anubhava's answer but swap the 2 instances of \w* with \d* as the only difference between \w and [A-Za-z_] is the 0-9 characters

This has the effect of reducing the number of steps from 588 to 90

(?<=[\s>])#(\d*[A-Za-z_]+\d*)\b(?!;)

Regex101 demo

Upvotes: 1

anubhava
anubhava

Reputation: 785196

You can use a negative lookahead reegex:

/(?<=[\s>]|^)#(\w*[A-Za-z_]+\w*)\b(?!;)/
  • \b - word boundary ensures that we are at end of word
  • (?!;) - asserts that we don't have semi-colon at next position

RegEx Demo

Upvotes: 4

Related Questions