Reputation: 552
Yes, there are lots of hashtag regex available here but none is suiting my needs. And no one is actually able to solve the problem.
The Regex should consider the following hashtags as valid:
#validhashtag
#valid_hashtag
#validhashtag_with_space_before_or_after
#valid_hashtag_chars_öÖäÄüÜß
...and not valid shoulw be:
ipsum#notvalid //Not valid: Connected to Word
http://google.com/#results //Not valid: Same as above
#not-valid
#not!valid
Allowed Characters should be: a-Z,0-9,öÖäÄüÜß,_
Max length should be 50 characters.
The main problem is the part where the hashtags is "connected" to another textpart. I don't know how to solve that problem.
This is what I attempted to do
/([\p{Pc}\p{N}\p{L}\p{Mn}]{1,50})/u
That one works pretty well but doesn't consider the "word#hashtag" - Problem.
Upvotes: 0
Views: 470
Reputation: 627488
You may use either of the two below:
/(?<!\S)#\w+(?!\S)/u
/(?<!\S)#[\w\p{M}\p{Pc}]+(?!\S)/u
See the regex demo. If you want to restrict the word part length, keep your {1,50}
quantifier - /(?<!\S)#\w{1,50}(?!\S)/u
.
Also note: \w
even with u
modifier does not match the same chars that are are considered "word" in .NET, Java, Python re
regex. You may decide to include other classes to fill the gap and use [\w\p{M}\p{Pc}]+
instead of just \w
where \p{M}
matches any diacritics and \p{Pc}
matches any connector punctuation.
Details
(?<!\S)
- a whitespace or start of string required right before#
- a #
sign\w+
- 1+ word chars (NOTE if you want to restrict its length from 1 to 50, replace +
with {1,50}
) (also, note that u
modifier lets the PCRE engine to match any Unicode letters and digits with \w
shorthand)[\w\p{M}\p{Pc}]
- matches 1+ word chars + all diacritics (\p{M}
) and all connector punctuation (\p{Pc}
, considered as word in .NET regex)(?!\S)
- a whitespace or end of string required right after.$s = "#validhashtag
#valid_hashtag
#validhashtag_with_space_before_or_after
#valid_hashtag_chars_öÖäÄüÜß
...and not valid shoulw be:
ipsum#notvalid //Not valid: Connected to Word
http://google.com/#results //Not valid: Same as above
#not-valid
#not!valid";
if (preg_match_all('~(?<!\S)#\w+(?!\S)~u', $s, $matches)) {
print_r($matches[0]);
}
Output:
Array
(
[0] => #validhashtag
[1] => #valid_hashtag
[2] => #validhashtag_with_space_before_or_after
[3] => #valid_hashtag_chars_öÖäÄüÜß
)
Upvotes: 3
Reputation: 27743
I think your original expression is pretty great, we'd just modify that with:
^\s*#([\p{Pc}\p{N}\p{L}\p{Mn}]{1,50})$
$re = '/^\s*#([\p{Pc}\p{N}\p{L}\p{Mn}]{1,50})$/um';
$str = '#validhashtag
#valid_hashtag
#validhashtag_with_space_before_or_after
#valid_hashtag_chars_öÖäÄüÜß
ipsum#notvalid //Not valid: Connected to Word
http://google.com/#results //Not valid: Same as above
#not-valid
#not!valid';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
array(4) {
[0]=>
array(2) {
[0]=>
string(13) "#validhashtag"
[1]=>
string(12) "validhashtag"
}
[1]=>
array(2) {
[0]=>
string(14) "#valid_hashtag"
[1]=>
string(13) "valid_hashtag"
}
[2]=>
array(2) {
[0]=>
string(41) " #validhashtag_with_space_before_or_after"
[1]=>
string(39) "validhashtag_with_space_before_or_after"
}
[3]=>
array(2) {
[0]=>
string(35) "#valid_hashtag_chars_öÖäÄüÜß"
[1]=>
string(34) "valid_hashtag_chars_öÖäÄüÜß"
}
}
Upvotes: 4