Reputation: 4290
I have a website that works for multiple languages and i am looking for a php function to strip out all the junk from a string across multiple languages.
example: I have the following being inserted into my database its in hindi, but it is the same for other languages.
कमबख़्त को गाली भी सलीक़े से नहीं दी जाती...\'
so you can see i am getting the ...\' characters that aren't wanted.
This doesnt cut it for multiple languages
$newString = preg_replace('/[^a-z0-9]/i', ' ', $text);
I have also tried which i dont really understand what is going on here which also doesnt work.
$newString = preg_replace(''/^[\p{L}\p{M}\p{Nd}]{2,}$/u'', ' ', $text);
i really just need to strip out everything thats not a letter or a number on the keyboard i.e.
!@£$%^&*()_+=.<>/, etc etc
I am not sure whether the ...\' in the string isnt really what it appears if that makes any sense because if i run.
$newString = str_replace("...\'", "", $text);
This is my first real dive into multi languages.
Upvotes: 0
Views: 1098
Reputation: 2759
I managed to get them out using this:
$test = 'कमबख़्त को गाली भी सलीक़े से नहीं दी जाती...\\';
$test = preg_replace('@[^\x{0900}-\x{097F}]@u', '', $test);
Output
कमबख़्तकोगालीभीसलीक़ेसेनहींदीजाती
The regular expression I used replaces all characters which are not in that unicode range.
Upvotes: 5