Reputation: 5663
Is there a simple regex that will catch all non-english characters? It would need to allow common punctation and symbols, but no special characters such as Russian, Japanese, etc.
Looking for something to work in PHP.
Upvotes: 2
Views: 3224
Reputation: 9318
use hex codes, e.g. this cleans out all non-ascii characters as well as line endings, and replaces them with spaces. space (\x20
) is deliberately left out of the range so that consecutive runs of spaces and/or special chars are replaced with a single space.
$clean = trim(preg_replace('/[^\x21-\x7E]+/', ' ', $input));
Upvotes: 0
Reputation: 34632
Since in your comment your referring to addresses, they might contain digits too. So:
preg_replace('/[^[:alpha:][:punct:][:digit:]]/u', utf8_encode($input), '');
Should replace your unwanted characters. The [:alpha:]
class will only work, if your locale is set up correctly, though. If, for example, it's set to de_DE
, not only "a" through "z" are regarded characters, but also "exotics" like "ä", "ö", "è", and the like.
Also, since you don't want "Russian, Japanese, etc.", note the u
modifier. The input has to be UTF-8 encoded in order to not break it and give you wrong results.
Upvotes: 2
Reputation: 5663
This q/a seemed to handle it: PHP Validate string characters are UK or US Keyboard characters
Upvotes: 0