tobysas
tobysas

Reputation: 308

How to get UTF-8 Hashtags without special chars in PHP

I've got an issue extracting only UTF-8 letters like ä,ö,ü,ß (let's say letters that are used in words) without chars like !"§$%&/()+' etc.

  function getHashtags($string)
{
    $string = html_entity_decode($string, ENT_QUOTES, "utf-8");
    preg_match_all('/(\#)([^\s]+)/u', $string, $matches);
    if ($matches) {
        $hashtagsArray = array_count_values($matches[0]);
        $hashtags = array_keys($hashtagsArray);
        $hashtagLine = '';
        foreach ($hashtags as $hashs) {
            $hashs = strtolower(trim($hashs));
            $hashtagLine .= $hashs;
        }
    }
    return $hashtagLine;
}

that is my current solution, it receives a string text and extracts the hashtags out of it and returns them in line. The Problem is that with this solution also hashtags being processed like #example!"$/% (and not being cut just before the ! like #example).

Does someone has an (Regex) approach to extract twitter like UTF-8 hashtags clean, without those unwanted punctuation characters, from a string in PHP?

Upvotes: 0

Views: 2103

Answers (2)

Mahdi
Mahdi

Reputation: 335

you can use below regex

$regex = "(?:#)([\p{L}\p{N}_](?:(?:[\p{L}\p{N}_]|(?:\.(?!\.))){0,28}(?:[\p{L}\p{N}_]))?)";

it works similar to Facebook and Instagram hashtags.

gist in github: https://gist.github.com/khanzadimahdi/2ecfe1ba38860db132b4543ab5126926

and test it using below links:

https://regexr.com/4suqt

https://regex101.com/r/4SAxik/1

https://www.regexpal.com/?fam=113956

Upvotes: 0

Toto
Toto

Reputation: 91430

Use unicode property:

preg_match_all('/#(\p{L}+)/u', $string, $matches);

\p{L} stands for any letter in any language.

Upvotes: 4

Related Questions