Don Code
Don Code

Reputation: 878

Regex to check if a string contains letters or numbers

I'm trying to find the best way to check if a string has any "weird" UTF-8 characters enclosed.

Basically i'm looking for something that would guard against all the different UTF 8 control symbols non-white spaces that can be hidden in a string. When I say hidden, I am implying, printing the string in the screen would not display those characters. They would either be as spaces, or as blank characters.

My previous approach to solve this was to return true if the string is one of those characters:

  function isUnusualString($string) {
    if($string == "")
      return TRUE;

      $char = ord($string);

      if($char < 33)
        return TRUE;
      if($char > 8191 && $char < 8208)
        return TRUE;
      if($char > 8231 && $char < 8240)
        return TRUE;

      switch($char) {
        case 160:    // Non-Breaking Space
        case 8287:    // Medium Mathematical Space
          return TRUE;
          break;
      }
    return FALSE;
  }

However, this does not catch all cases, and I don't know why. I'm assuming you can have some of these characters that can be more than length 1, or length 0?.

So next I tried iterating over the characters of the string and checking if the string contains any of the "hidden" characters.

For example running the following code:

        function isUnusualUTF($string) {
      if($string == "")
        return TRUE;

   $strlen = strlen($string);

  for ($i = 0; $i < $strlen; ++$i) {
    $char = ord($string[$i]);

     if($char < 33)
       echo "char = ".$char." at index: ".$i." is < 33";


     if($char > 8191 && $char < 8208)
       echo "char = ".$char." at index: ".$i." is >8191 and < 8240 ";


     if($char > 8231 && $char < 8240)
       echo "char = ".$char." at index: ".$i." is > 8231 and < 8240 ";


     switch($char) {
     case 160:    // Non-Breaking Space
     case 8287:    // Medium Mathematical Space
       echo "cases<br>"; //return TRUE;
       break;
     }
   }
     return FALSE;
}

$string = "Unicode ";
echo isUnusualUTF($string);

Outputs:

char = 32 at index: 7 is < 33

I think that the best way to do this would be with a regex that does:

if string has (numbers or letters or " " or other symbols 
               that can be printed and seen in the screen)
  return true
else
  return false

Thank you

Upvotes: 0

Views: 541

Answers (2)

Gi0rgi0s
Gi0rgi0s

Reputation: 1867

In php you can use regex to find characters with certain properties using these escapes:

\p{xx} (inclusive)

\P{xx} (exclusive)

Where the xx is a certain property you are looking for.

Here is a list of properties: http://php.net/manual/en/regexp.reference.unicode.php

I think for your case you would want to fashion a statement like this:

[\P{xx}\P{yy}..etc]+

where "...etc" is symbolic and represents additional properties. This should match all of the characters you're looking for.

Here's a link to test your regex statement: http://www.phpliveregex.com/

Upvotes: 1

000
000

Reputation: 27227

Use the multibyte versions of those methods:

mb_strlen: https://www.php.net/mb_strlen

Although I believe this method may do exactly what you want: https://www.php.net/manual/en/function.mb-check-encoding.php

Upvotes: 0

Related Questions