Reputation: 878

Regex to check if a string contains letters or numbers

I'm trying to find the best way to check if a string has any "weird" UTF-8 characters enclosed.

Basically i'm looking for something that would guard against all the different UTF 8 control symbols non-white spaces that can be hidden in a string. When I say hidden, I am implying, printing the string in the screen would not display those characters. They would either be as spaces, or as blank characters.

My previous approach to solve this was to return true if the string is one of those characters:

  function isUnusualString($string) {
    if($string == "")
      return TRUE;

      $char = ord($string);

      if($char < 33)
        return TRUE;
      if($char > 8191 && $char < 8208)
        return TRUE;
      if($char > 8231 && $char < 8240)
        return TRUE;

      switch($char) {
        case 160:    // Non-Breaking Space
        case 8287:    // Medium Mathematical Space
          return TRUE;
          break;
      }
    return FALSE;
  }

However, this does not catch all cases, and I don't know why. I'm assuming you can have some of these characters that can be more than length 1, or length 0?.

So next I tried iterating over the characters of the string and checking if the string contains any of the "hidden" characters.

For example running the following code:

        function isUnusualUTF($string) {
      if($string == "")
        return TRUE;

   $strlen = strlen($string);

  for ($i = 0; $i < $strlen; ++$i) {
    $char = ord($string[$i]);

     if($char < 33)
       echo "char = ".$char." at index: ".$i." is < 33";


     if($char > 8191 && $char < 8208)
       echo "char = ".$char." at index: ".$i." is >8191 and < 8240 ";


     if($char > 8231 && $char < 8240)
       echo "char = ".$char." at index: ".$i." is > 8231 and < 8240 ";


     switch($char) {
     case 160:    // Non-Breaking Space
     case 8287:    // Medium Mathematical Space
       echo "cases<br>"; //return TRUE;
       break;
     }
   }
     return FALSE;
}

$string = "Unicode ";
echo isUnusualUTF($string);

Outputs:

char = 32 at index: 7 is < 33

I think that the best way to do this would be with a regex that does:

if string has (numbers or letters or " " or other symbols 
               that can be printed and seen in the screen)
  return true
else
  return false

Thank you

Upvotes: 0