Reputation: 308
I've got an issue extracting only UTF-8 letters like ä,ö,ü,ß
(let's say letters that are used in words) without chars like !"§$%&/()+'
etc.
function getHashtags($string)
{
$string = html_entity_decode($string, ENT_QUOTES, "utf-8");
preg_match_all('/(\#)([^\s]+)/u', $string, $matches);
if ($matches) {
$hashtagsArray = array_count_values($matches[0]);
$hashtags = array_keys($hashtagsArray);
$hashtagLine = '';
foreach ($hashtags as $hashs) {
$hashs = strtolower(trim($hashs));
$hashtagLine .= $hashs;
}
}
return $hashtagLine;
}
that is my current solution, it receives a string text and extracts the hashtags out of it and returns them in line. The Problem is that with this solution also hashtags being processed like #example!"$/%
(and not being cut just before the !
like #example
).
Does someone has an (Regex) approach to extract twitter like UTF-8 hashtags clean, without those unwanted punctuation characters, from a string in PHP?
Upvotes: 0
Views: 2103
Reputation: 335
you can use below regex
$regex = "(?:#)([\p{L}\p{N}_](?:(?:[\p{L}\p{N}_]|(?:\.(?!\.))){0,28}(?:[\p{L}\p{N}_]))?)";
it works similar to Facebook and Instagram hashtags.
gist in github: https://gist.github.com/khanzadimahdi/2ecfe1ba38860db132b4543ab5126926
and test it using below links:
https://regex101.com/r/4SAxik/1
https://www.regexpal.com/?fam=113956
Upvotes: 0
Reputation: 91430
Use unicode property:
preg_match_all('/#(\p{L}+)/u', $string, $matches);
\p{L}
stands for any letter in any language.
Upvotes: 4