Atif
Atif

Reputation: 10880

Convert string into slug with single-hyphen delimiters only

I would like to sanitize a string in to a URL so this is what I basically need:

  1. Everything must be removed except alphanumeric characters and spaces and dashed.
  2. Spaces should be converter into dashes.

Eg.

This, is the URL!

must return

this-is-the-url

Upvotes: 25

Views: 33281

Answers (10)

mickmackusa
mickmackusa

Reputation: 47944

The OP is not explicitly describing all of the attributes of a slug, but this is what I am gathering from the intent.

My interpretation of a perfect, valid, condensed slug aligns with this post: https://wordpress.stackexchange.com/questions/149191/slug-formatting-acceptable-characters#:~:text=However%2C%20we%20can%20summarise%20the,or%20end%20with%20a%20hyphen.

I find none of the earlier posted answers to achieve this consistently (and I'm not even stretching the scope of the question to include multi-byte characters).

  1. convert all characters to lowercase
  2. replace all sequences of one or more non-alphanumeric characters to a single hyphen.
  3. trim the leading and trailing hyphens from the string.

I recommend the following one-liner which doesn't bother declaring single-use variables:

return trim(preg_replace('/[^a-z0-9]+/', '-', strtolower($string)), '-');

Not shown in my demo link, here is an attempt to better handle multibyte strings, though it doesn't quite accommodate as many scenarios as Casimir's answer.

return trim(preg_replace('/[^a-z0-9]+/', '-', strtolower(iconv('utf-8', 'ascii//translit', $string))), '-');

I have also prepared a demonstration which highlights what I consider to be inaccuracies in the other answers. (Demo)

'This, is - - the URL!' input
'this-is-the-url'       expected

'this-is-----the-url'   SilentGhost
'this-is-the-url'       mario
'This-is---the-URL'     Rooneyl
'This-is-the-URL'       AbhishekGoel
'This, is - - the URL!' HelloHack
'This, is - - the URL!' DenisMatafonov
'This,-is-----the-URL!' AdeelRazaAzeemi
'this-is-the-url'       mickmackusa

---
'Mork & Mindy'      input
'mork-mindy'        expected

'mork--mindy'       SilentGhost
'mork-mindy'        mario
'Mork--Mindy'       Rooneyl
'Mork-Mindy'        AbhishekGoel
'Mork & Mindy'  HelloHack
'Mork & Mindy'      DenisMatafonov
'Mork-&-Mindy'      AdeelRazaAzeemi
'mork-mindy'        mickmackusa

---
'What the_underscore ?!?'   input
'what-the-underscore'       expected

'what-theunderscore'        SilentGhost
'what-the_underscore'       mario
'What-theunderscore-'       Rooneyl
'What-theunderscore-'       AbhishekGoel
'What the_underscore ?!?'   HelloHack
'What the_underscore ?!?'   DenisMatafonov
'What-the_underscore-?!?'   AdeelRazaAzeemi
'what-the-underscore'       mickmackusa

Upvotes: 3

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89565

Using intl transliterator is a good option because with it you can easily handle complicated cases with a single set of rules. I added custom rules to illustrate how it can be flexible and how you can keep a maximum of meaningful informations. Feel free to remove them and to add your own rules.

$strings = [
    'This, is - - the URL!',
    'Holmes & Yoyo',
    'L’Œil de démon',
    'How to win 1000€?',
    '€, $ & other currency symbols',
    'Und die Katze fraß alle mäuse.',
    'Белите рози на София',
    'പോണ്ടിച്ചേരി സൂര്യനു കീഴിൽ',
];

$rules = <<<'RULES'
# Transliteration
:: Any-Latin ;   :: Latin-Ascii ;

# examples of custom replacements
'&' > ' and ' ;
[^0-9][01]? { € > ' euro' ;   € > ' euros' ;
[^0-9][01]? { '$' > ' dollar' ;   '$' > ' dollars' ;
:: Null ;

# slugify
[^[:alnum:]&[:ascii:]]+ > '-' ;
:: Lower ;

# trim
[$] { '-' > &Remove() ;
'-' } [$] > &Remove() ;
RULES;

$tsl = Transliterator::createFromRules($rules, Transliterator::FORWARD);
$results = array_map(fn($s) => $tsl->transliterate($s), $strings);
print_r($results);

demo

Unfortunately, the PHP manual is totally empty about ICU transformations but you can find informations about them here.

Upvotes: 1

Hello Hack
Hello Hack

Reputation: 109

    function isolate($data) {
        
        $data = trim($data);
        $data = stripslashes($data);
        $data = htmlspecialchars($data);
        
        return $data;
    }

Upvotes: 0

Adeel Raza Azeemi
Adeel Raza Azeemi

Reputation: 793

The following will replace spaces with dashes.

$str = str_replace(' ', '-', $str);

Then the following statement will remove everything except alphanumeric characters and dashed. (didn't have spaces because in previous step we had replaced them with dashes.

// Char representation     0 -  9   A-   Z   a-   z  -    
$str = preg_replace('/[^\x30-\x39\x41-\x5A\x61-\x7A\x2D]/', '', $str);

Which is equivalent to

$str = preg_replace('/[^0-9A-Za-z-]+/', '', $str);

FYI: To remove all special characters from a string use

$str = preg_replace('/[^\x20-\x7E]/', '', $str); 

\x20 is hexadecimal for space that is start of Acsii charecter and \x7E is tilde. As accordingly to wikipedia https://en.wikipedia.org/wiki/ASCII#Printable_characters

FYI: look into the Hex Column for the interval 20-7E

Printable characters Codes 20hex to 7Ehex, known as the printable characters, represent letters, digits, punctuation marks, and a few miscellaneous symbols. There are 95 printable characters in total.

Upvotes: -1

DjimOnDev
DjimOnDev

Reputation: 399

You should use the slugify package and not reinvent the wheel ;)

https://github.com/cocur/slugify

Upvotes: -1

Denis Matafonov
Denis Matafonov

Reputation: 2802

All previous asnwers deal with url, but in case some one will need to sanitize string for login (e.g.) and keep it as text, here is you go:

function sanitizeText($str) {
    $withSpecCharacters = htmlspecialchars($str);
    $splitted_str = str_split($str);
    $result = '';
    foreach ($splitted_str as $letter){
        if (strpos($withSpecCharacters, $letter) !== false) {
            $result .= $letter;
        }
    }
    return $result;
}

echo sanitizeText('ОРРииыфвсси ajvnsakjvnHB "&nvsp;\n" <script>alert()</script>');
//ОРРииыфвсси ajvnsakjvnHB &nvsp;\n scriptalert()/script
//No injections possible, all info at max keeped

Upvotes: 0

Abhishek Goel
Abhishek Goel

Reputation: 19761

Try This

 function clean($string) {
       $string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens.
       $string = preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars.

       return preg_replace('/-+/', '-', $string); // Replaces multiple hyphens with single one.
    }

Usage:

echo clean('a|"bc!@£de^&$f g');

Will output: abcdef-g

source : https://stackoverflow.com/a/14114419/2439715

Upvotes: 1

user1484291
user1484291

Reputation: 11

This will do it in a Unix shell (I just tried it on my MacOS):

$ tr -cs A-Za-z '-' < infile.txt > outfile.txt

I got the idea from a blog post on More Shell, Less Egg

Upvotes: 1

Rooneyl
Rooneyl

Reputation: 7902

First strip unwanted characters

$new_string = preg_replace("/[^a-zA-Z0-9\s]/", "", $string);

Then changes spaces for unserscores

$url = preg_replace('/\s/', '-', $new_string);

Finally encode it ready for use

$new_url = urlencode($url);

Upvotes: 4

SilentGhost
SilentGhost

Reputation: 319701

function slug($z){
    $z = strtolower($z);
    $z = preg_replace('/[^a-z0-9 -]+/', '', $z);
    $z = str_replace(' ', '-', $z);
    return trim($z, '-');
}

Upvotes: 52

Related Questions