Kenneth P.
Kenneth P.

Reputation: 1816

Extract any unicode string occurence within a string using preg_match

I have this kind of string

sample İletişim form:: aşağıdaki formu

What I'm aiming is to extract the string that has a unicode/non-ascii character inside of it using preg_match or preg_match_all of php.

So I'm expecting a result of 2 İletişim and aşağıdaki word only.

Array
(
    [0] => İletişim 
    [1] => aşağıdaki
)

I just can't think of regular expression as I'm not good at it. Any aid is welcome.

Thank you so much.

Upvotes: 0

Views: 305

Answers (2)

Lebugg
Lebugg

Reputation: 313

I think a beginning of solution you want is here: How do I detect non-ASCII characters in a string?

By using preg_match(), you could do smthg like this:

preg_match_all('/[^\s]*[^\x20-\x7f]+[^\s]*/', $string, $matches);
print_r($matches);

Or, without preg_match, you can use the function mb_detect_encoding() to test the encoding of the string. In your case, you could use it this way:

$matches = array_filter(explode(' ', $string), function($item) {
    return !mb_detect_encoding($item, 'ASCII', TRUE);
});
print_r($matches);

But the last one is a bit warped ^^

Upvotes: 1

Toto
Toto

Reputation: 91430

You can use unicode properties:

$string = 'sample İletişim form:: aşağıdaki formu';
preg_match_all("/(\pL+)/u", $string, $matches); 
print_r($matches);

output:

Array
(
    [0] => Array
        (
            [0] => sample
            [1] => İletişim
            [2] => form
            [3] => aşağıdaki
            [4] => formu
        )

    [1] => Array
        (
            [0] => sample
            [1] => İletişim
            [2] => form
            [3] => aşağıdaki
            [4] => formu
        )

)

Upvotes: 1

Related Questions