Mike
Mike

Reputation: 552

Finding hashtags in Text

Yes, there are lots of hashtag regex available here but none is suiting my needs. And no one is actually able to solve the problem.

The Regex should consider the following hashtags as valid:

#validhashtag
#valid_hashtag
 #validhashtag_with_space_before_or_after
#valid_hashtag_chars_öÖäÄüÜß

...and not valid shoulw be:

ipsum#notvalid //Not valid: Connected to Word
http://google.com/#results //Not valid: Same as above
#not-valid
#not!valid

Allowed Characters should be: a-Z,0-9,öÖäÄüÜß,_

Max length should be 50 characters.

The main problem is the part where the hashtags is "connected" to another textpart. I don't know how to solve that problem.

This is what I attempted to do

/([\p{Pc}\p{N}\p{L}\p{Mn}]{1,50})/u

That one works pretty well but doesn't consider the "word#hashtag" - Problem.

Upvotes: 0

Views: 470

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627488

You may use either of the two below:

/(?<!\S)#\w+(?!\S)/u
/(?<!\S)#[\w\p{M}\p{Pc}]+(?!\S)/u

See the regex demo. If you want to restrict the word part length, keep your {1,50} quantifier - /(?<!\S)#\w{1,50}(?!\S)/u.

Also note: \w even with u modifier does not match the same chars that are are considered "word" in .NET, Java, Python re regex. You may decide to include other classes to fill the gap and use [\w\p{M}\p{Pc}]+ instead of just \w where \p{M} matches any diacritics and \p{Pc} matches any connector punctuation.

Details

  • (?<!\S) - a whitespace or start of string required right before
  • # - a # sign
  • \w+ - 1+ word chars (NOTE if you want to restrict its length from 1 to 50, replace + with {1,50}) (also, note that u modifier lets the PCRE engine to match any Unicode letters and digits with \w shorthand)
  • [\w\p{M}\p{Pc}] - matches 1+ word chars + all diacritics (\p{M}) and all connector punctuation (\p{Pc}, considered as word in .NET regex)
  • (?!\S) - a whitespace or end of string required right after.

PHP demo:

$s = "#validhashtag
#valid_hashtag
 #validhashtag_with_space_before_or_after
#valid_hashtag_chars_öÖäÄüÜß
...and not valid shoulw be:

ipsum#notvalid //Not valid: Connected to Word
http://google.com/#results //Not valid: Same as above
#not-valid
#not!valid";
if (preg_match_all('~(?<!\S)#\w+(?!\S)~u', $s, $matches)) {
   print_r($matches[0]);
}

Output:

Array
(
    [0] => #validhashtag
    [1] => #valid_hashtag
    [2] => #validhashtag_with_space_before_or_after
    [3] => #valid_hashtag_chars_öÖäÄüÜß
)

Upvotes: 3

Emma
Emma

Reputation: 27743

I think your original expression is pretty great, we'd just modify that with:

^\s*#([\p{Pc}\p{N}\p{L}\p{Mn}]{1,50})$

Demo

Test

$re = '/^\s*#([\p{Pc}\p{N}\p{L}\p{Mn}]{1,50})$/um';
$str = '#validhashtag
#valid_hashtag
 #validhashtag_with_space_before_or_after
#valid_hashtag_chars_öÖäÄüÜß

ipsum#notvalid //Not valid: Connected to Word
http://google.com/#results //Not valid: Same as above
#not-valid
#not!valid';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

// Print the entire match result
var_dump($matches);

Output

array(4) {
  [0]=>
  array(2) {
    [0]=>
    string(13) "#validhashtag"
    [1]=>
    string(12) "validhashtag"
  }
  [1]=>
  array(2) {
    [0]=>
    string(14) "#valid_hashtag"
    [1]=>
    string(13) "valid_hashtag"
  }
  [2]=>
  array(2) {
    [0]=>
    string(41) " #validhashtag_with_space_before_or_after"
    [1]=>
    string(39) "validhashtag_with_space_before_or_after"
  }
  [3]=>
  array(2) {
    [0]=>
    string(35) "#valid_hashtag_chars_öÖäÄüÜß"
    [1]=>
    string(34) "valid_hashtag_chars_öÖäÄüÜß"
  }
}

Upvotes: 4

Related Questions