Reputation: 415

Regex to ignore accents? PHP

Is there anyway to make a Regex that ignores accents?

For example:

preg_replace("/$word/i", "<b>$word</b>", $str);

The "i" in the regex is to ignore case sensitive, but is there anyway to match, for example
java with Jávã?

I did try to make a copy of the $str, change the content to a no accent string and find the index of all the occurrences. But the index of the 2 strings seems to be different, even though it's just with no accents.

(I did a research, but all I could found is how to remove accents from a string)

Upvotes: 16

Answers (5)

user267885

Reputation:

I don't think, there is such a way. That would be locale-dependent and you probably want a "/u" switch first to enable UTF-8 in pattern strings.

I would probably do something like this.

function prepare($pattern)
{
   $replacements = Array("a" => "[áàäâ]",
                         "e" => "[éèëê]" ...);
   return str_replace(array_keys($replacements), $replacements, $pattern);  
}

pcre_replace("/(" . prepare($word) . ")/ui", "<b>\\1</b>", $str);

In your case, index was different, because unless you used mb_string you were probably dealing with UTF-8 which uses more than one byte per character.

Upvotes: 7

Niet the Dark Absol

Reputation: 324640

Set an appropriate locale (such as fr_FR, for example) and use the strcoll function to compare a string ignoring accents.

Upvotes: 0

arun

Reputation: 3677

<?php

if (!function_exists('htmlspecialchars_decode')) {
    function htmlspecialchars_decode($text) {
        return str_replace(array('&lt;','&gt;','&quot;','&amp;'),array('<','>','"','&'),$text);
    }
}

function removeMarkings($text) 
{
    $text=htmlentities($text);    
    // components (key+value = entity name, replace with key)
    $table1=array(
        'a'=>'grave|acute|circ|tilde|uml|ring',
        'ae'=>'lig',
        'c'=>'cedil',
        'e'=>'grave|acute|circ|uml',
        'i'=>'grave|acute|circ|uml',
        'n'=>'tilde',
        'o'=>'grave|acute|circ|tilde|uml|slash',
        's'=>'zlig', // maybe szlig=>ss would be more accurate?
        'u'=>'grave|acute|circ|uml',
        'y'=>'acute'
    );

    // direct (key = entity, replace with value)
    $table2=array(
        '&ETH;'=>'D',   // not sure about these character replacements
        '&eth;'=>'d',   // is an ð pronounced like a 'd'?
        '&THORN;'=>'B', // is a þ pronounced like a 'b'?
        '&thorn;'=>'b'  // don't think so, but the symbols looked like a d,b so...
    );

    foreach ($table1 as $k=>$v) $text=preg_replace("/&($k)($v);/i",'\1',$text);
    $text=str_replace(array_keys($table2),$table2,$text);    
    return htmlspecialchars_decode($text);
}

$text="Here two words, one in normal way and another in accent mode java and jává and me searched with java and it found both occurences(higlighted form this sentence) java and jává<br/>";
$find="java"; //The word going to higlight,trying to higlight both java and jává by this seacrh word
$text=utf8_decode($text);
$find=removeMarkings(utf8_decode($find)); $len=strlen($find);
preg_match_all('/\b'.preg_quote($find).'\b/i', removeMarkings($text), $matches, PREG_OFFSET_CAPTURE);
$start=0; $newtext="";
foreach ($matches[0] as $m) {
    $pos=$m[1];
    $newtext.=substr($text,$start,$pos-$start);
    $newtext.="<b>".substr($text,$pos,$len)."</b>";
    $start=$pos+$len;
}
$newtext.=substr($text,$start);
echo "<blockquote>",$newtext,"</blockquote>";

?>

I think something like this will help you, I got this one from a forum.. just take a look.

Upvotes: 1

Spudley

Reputation: 168685

Regex isn't the tool for you here.

The answer you're looking for is the strtr() function.

This function replaces specified characters in a string, and is exactly what you're looking for.

In your example, Jávã, you could use a strtr() call like this:

$replacements = array('á'=>'a', 'ã'=>'a');
$output = strtr("Jávã",$replacements);

$output will now contain Java.

Of course, you'll need a bigger $replacements array to deal with all the characters you want to work with. See the the manual page I linked for some examples of how people are using it.

Note that there isn't a simple blanket list of characters, because firstly it would be huge, and secondly, the same starting character may need to be translated differently in different contexts or languages.

Hope that helps.

Upvotes: 1

neevek

Reputation: 12138

Java and Jávã are different words, there's no native support in regex for removing accents, but you can include all possible combinations of characters with or without accents that you want to replace in your regex.

Like preg_replace("/java|Jávã|jáva|javã/i", "<b>$word</b>", $str);.

Good luck!

Upvotes: 2

Regex to ignore accents? PHP

Answers (5)

Related Questions