Jon
Jon

Reputation: 2128

Is there a way to turn accented characters into the closest non-accent counterpart?

I have to convert a url like "você-é-um-ás-da-aviação" to "voce-e-um-as-da-aviacao", to make it reading friendly on the SERP.

I could a common replacement , but I don't really like having to list each and every character, because I find it clunky and I want to keep language specific characters out of the source code as much as i can.

Is it possible? is it viable?

Upvotes: 3

Views: 377

Answers (4)

Arkh
Arkh

Reputation: 8459

You could use a combination of iconv to get your string as ASCII then some preg_replace to remove the unwanted characters.

Something like:

$string = "você-é-um-ás-da-aviação";
$collated = iconv('UTF-8', 'ASCII//TRANSLIT', $string);
$filtred = preg_replace('`[^-a-zA-Z0-9]`', '', $collated);
echo $filtred;

Upvotes: 0

EPP
EPP

Reputation: 58

function url_safe($string){
    $url = $string;
    setlocale(LC_ALL, 'fr_FR'); // change to the one of your language
    $url = iconv("UTF-8", "ASCII//TRANSLIT", $url);  
    $url = preg_replace('~[^\\pL0-9_]+~u', '-', $url);
    $url = trim($url, "-");
    $url = strtolower($url);
    return $url;
    }

Upvotes: 3

Borealid
Borealid

Reputation: 98559

You could use the canonical decomposition mapping provided by the Unicode foundation (the files in http://www.unicode.org/Public/UNIDATA/ ).

However, this is not as simple as you seem to think it is - believe it or not, there is a "kcal" symbol whose canonical decomposition is four characters long.

You may also wish to consult the numeric equivalents tables there, as a "circled number seven" should probably map to the ASCII numeral seven, and so forth.

I strongly advise against this strategy, however - you're butchering your text for little gain, and can't recover the original input once you've transformed it.

Upvotes: 2

gion_13
gion_13

Reputation: 41533

I suggest you map every special character and it's replacement into an array and then replace the text with a regex.
I know that you stated that you do not want to use a common replacement, but it's the only viable way to do so. You could filter them out(by checking if their ascii code is situated in a certain range) but it's not the same for the correct replacement.

Upvotes: 0

Related Questions