Aurimas
Aurimas

Reputation: 2493

How to url-encode only non-ASCII symbols of URL in PHP, but leave reserved symbols un-encoded?

I have a URL that looks like this (note the “„ symbols):

http://tinklarastis.omnitel.lt/kokius-aptarnavimo-kanalus-klientui-siulo-„omnitel“-1494

I receive it from SimplePie parser, if that matters. Now, if you try going to this specific URL in your browser and copy it from the address bar, you would get a URL that has the non-ASCII symbols percent encoded:

http://tinklarastis.omnitel.lt/kokius-aptarnavimo-kanalus-klientui-siulo-%E2%80%9Eomnitel%E2%80%9C-1494

I am trying to understand how can I mimic the same conversion in PHP. I cannot simply use urlencode() or urlrawencode() as they encode both non-ASCII symbols and reserved symbols, while in my case the reserved symbols (/?&, etc) should stay as they are.

So far I have only seen solutions that involve splitting the URL into pieces between reserved symbols and then using urlencode(), but that feels hackish to me and I hope there's a more elegant solution. I have tried various variations of iconv(), mb_convert_encoding(), yet with no success yet.

Upvotes: 14

Views: 7245

Answers (5)

alexandru.asandei
alexandru.asandei

Reputation: 266

I have a simple one-liner that I use to do in-place encoding only on non-ASCII characters using preg_match_callback:

preg_replace_callback('/[^\x20-\x7f]/', function($match) {
    return urlencode($match[0]);
}, $url);

Note that the anonymous function is only supported in PHP 5.3+.

Upvotes: 22

     Юрий Светлов
Юрий Светлов

Reputation: 1750

function cyrillicaToUrlencode($text){
return $line = preg_replace_callback('/([а-яё])/ui',
                            function ($matches) {
                                return urlencode($matches[0]);
                            }, 
                            $text); 
}

echo cyrillicaToUrlencode("https://test.com/Москваёtext1Воронежtext2Москваёtext3yМоскваё___-Москваё");

Will return - https://test.com/%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D0%B0%D1%91text1%D0%92%D0%BE%D1%80%D0%BE%D0%BD%D0%B5%D0%B6text2%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D0%B0%D1%91text3y%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D0%B0%D1%91___-%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D0%B0%D1%91

Upvotes: 0

urmaul
urmaul

Reputation: 7340

This function may help:

function sanitizeUrl($url)
{
    $chars = '$-_.+!*\'(),{}|\\^~[]`<>#%";/?:@&=';
    $pattern = '~[^a-z0-9' . preg_quote($chars, '~') . ']+~iu';

    $callback = create_function('$matches', 'return urlencode($matches[0]);');

    return preg_replace_callback($pattern, $callback, $url);
}

Upvotes: 2

Aurimas
Aurimas

Reputation: 2493

After researching a bit, I came to a conclusion that there's no way to do nicely in PHP (however, other languages like python / perl do seem to have functions exactly for this use case). This is the function I came up with (ensures encoding of path fragment of the URL):

function url_path_encode($url) {
    $path = parse_url($url, PHP_URL_PATH);
    if (strpos($path,'%') !== false) return $url; //avoid double encoding
    else {
        $encoded_path = array_map('urlencode', explode('/', $path));
        return str_replace($path, implode('/', $encoded_path), $url);
    }   
}

Upvotes: 12

SamHennessy
SamHennessy

Reputation: 4326

I think this will do what you want.

<?php

$string = 'http://tinklarastis.omnitel.lt/kokius-aptarnavimo-kanalus-klientui-siulo-„omnitel“-1494/?foo=bar&fizz=buzz';

var_dump(filter_var($string, FILTER_SANITIZE_STRING, FILTER_FLAG_ENCODE_HIGH));

This will get you:

$ php test.php
string(140) "http://tinklarastis.omnitel.lt/kokius-aptarnavimo-kanalus-klientui-siulo-&#226;&#128;&#158;omnitel&#226;&#128;&#156;-1494/?foo=bar&fizz=buzz"

Upvotes: 1

Related Questions