Reputation: 10880
I would like to sanitize a string in to a URL so this is what I basically need:
Eg.
This, is the URL!
must return
this-is-the-url
Upvotes: 25
Views: 33281
Reputation: 47944
The OP is not explicitly describing all of the attributes of a slug, but this is what I am gathering from the intent.
My interpretation of a perfect, valid, condensed slug aligns with this post: https://wordpress.stackexchange.com/questions/149191/slug-formatting-acceptable-characters#:~:text=However%2C%20we%20can%20summarise%20the,or%20end%20with%20a%20hyphen.
I find none of the earlier posted answers to achieve this consistently (and I'm not even stretching the scope of the question to include multi-byte characters).
I recommend the following one-liner which doesn't bother declaring single-use variables:
return trim(preg_replace('/[^a-z0-9]+/', '-', strtolower($string)), '-');
Not shown in my demo link, here is an attempt to better handle multibyte strings, though it doesn't quite accommodate as many scenarios as Casimir's answer.
return trim(preg_replace('/[^a-z0-9]+/', '-', strtolower(iconv('utf-8', 'ascii//translit', $string))), '-');
I have also prepared a demonstration which highlights what I consider to be inaccuracies in the other answers. (Demo)
'This, is - - the URL!' input
'this-is-the-url' expected
'this-is-----the-url' SilentGhost
'this-is-the-url' mario
'This-is---the-URL' Rooneyl
'This-is-the-URL' AbhishekGoel
'This, is - - the URL!' HelloHack
'This, is - - the URL!' DenisMatafonov
'This,-is-----the-URL!' AdeelRazaAzeemi
'this-is-the-url' mickmackusa
---
'Mork & Mindy' input
'mork-mindy' expected
'mork--mindy' SilentGhost
'mork-mindy' mario
'Mork--Mindy' Rooneyl
'Mork-Mindy' AbhishekGoel
'Mork & Mindy' HelloHack
'Mork & Mindy' DenisMatafonov
'Mork-&-Mindy' AdeelRazaAzeemi
'mork-mindy' mickmackusa
---
'What the_underscore ?!?' input
'what-the-underscore' expected
'what-theunderscore' SilentGhost
'what-the_underscore' mario
'What-theunderscore-' Rooneyl
'What-theunderscore-' AbhishekGoel
'What the_underscore ?!?' HelloHack
'What the_underscore ?!?' DenisMatafonov
'What-the_underscore-?!?' AdeelRazaAzeemi
'what-the-underscore' mickmackusa
Upvotes: 3
Reputation: 89565
Using intl transliterator is a good option because with it you can easily handle complicated cases with a single set of rules. I added custom rules to illustrate how it can be flexible and how you can keep a maximum of meaningful informations. Feel free to remove them and to add your own rules.
$strings = [
'This, is - - the URL!',
'Holmes & Yoyo',
'L’Œil de démon',
'How to win 1000€?',
'€, $ & other currency symbols',
'Und die Katze fraß alle mäuse.',
'Белите рози на София',
'പോണ്ടിച്ചേരി സൂര്യനു കീഴിൽ',
];
$rules = <<<'RULES'
# Transliteration
:: Any-Latin ; :: Latin-Ascii ;
# examples of custom replacements
'&' > ' and ' ;
[^0-9][01]? { € > ' euro' ; € > ' euros' ;
[^0-9][01]? { '$' > ' dollar' ; '$' > ' dollars' ;
:: Null ;
# slugify
[^[:alnum:]&[:ascii:]]+ > '-' ;
:: Lower ;
# trim
[$] { '-' > &Remove() ;
'-' } [$] > &Remove() ;
RULES;
$tsl = Transliterator::createFromRules($rules, Transliterator::FORWARD);
$results = array_map(fn($s) => $tsl->transliterate($s), $strings);
print_r($results);
Unfortunately, the PHP manual is totally empty about ICU transformations but you can find informations about them here.
Upvotes: 1
Reputation: 109
function isolate($data) {
$data = trim($data);
$data = stripslashes($data);
$data = htmlspecialchars($data);
return $data;
}
Upvotes: 0
Reputation: 793
The following will replace spaces with dashes.
$str = str_replace(' ', '-', $str);
Then the following statement will remove everything except alphanumeric characters and dashed. (didn't have spaces because in previous step we had replaced them with dashes.
// Char representation 0 - 9 A- Z a- z -
$str = preg_replace('/[^\x30-\x39\x41-\x5A\x61-\x7A\x2D]/', '', $str);
Which is equivalent to
$str = preg_replace('/[^0-9A-Za-z-]+/', '', $str);
FYI: To remove all special characters from a string use
$str = preg_replace('/[^\x20-\x7E]/', '', $str);
\x20 is hexadecimal for space that is start of Acsii charecter and \x7E is tilde. As accordingly to wikipedia https://en.wikipedia.org/wiki/ASCII#Printable_characters
FYI: look into the Hex Column for the interval 20-7E
Printable characters Codes 20hex to 7Ehex, known as the printable characters, represent letters, digits, punctuation marks, and a few miscellaneous symbols. There are 95 printable characters in total.
Upvotes: -1
Reputation: 399
You should use the slugify package and not reinvent the wheel ;)
https://github.com/cocur/slugify
Upvotes: -1
Reputation: 2802
All previous asnwers deal with url, but in case some one will need to sanitize string for login (e.g.) and keep it as text, here is you go:
function sanitizeText($str) {
$withSpecCharacters = htmlspecialchars($str);
$splitted_str = str_split($str);
$result = '';
foreach ($splitted_str as $letter){
if (strpos($withSpecCharacters, $letter) !== false) {
$result .= $letter;
}
}
return $result;
}
echo sanitizeText('ОРРииыфвсси ajvnsakjvnHB "&nvsp;\n" <script>alert()</script>');
//ОРРииыфвсси ajvnsakjvnHB &nvsp;\n scriptalert()/script
//No injections possible, all info at max keeped
Upvotes: 0
Reputation: 19761
Try This
function clean($string) {
$string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens.
$string = preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars.
return preg_replace('/-+/', '-', $string); // Replaces multiple hyphens with single one.
}
Usage:
echo clean('a|"bc!@£de^&$f g');
Will output: abcdef-g
source : https://stackoverflow.com/a/14114419/2439715
Upvotes: 1
Reputation: 11
This will do it in a Unix shell (I just tried it on my MacOS):
$ tr -cs A-Za-z '-' < infile.txt > outfile.txt
I got the idea from a blog post on More Shell, Less Egg
Upvotes: 1
Reputation: 7902
First strip unwanted characters
$new_string = preg_replace("/[^a-zA-Z0-9\s]/", "", $string);
Then changes spaces for unserscores
$url = preg_replace('/\s/', '-', $new_string);
Finally encode it ready for use
$new_url = urlencode($url);
Upvotes: 4
Reputation: 319701
function slug($z){
$z = strtolower($z);
$z = preg_replace('/[^a-z0-9 -]+/', '', $z);
$z = str_replace(' ', '-', $z);
return trim($z, '-');
}
Upvotes: 52