Mala
Mala

Reputation: 14823

php regular expression to filter out junk

So I have an interesting problem: I have a string, and for the most part I know what to expect:

http://www.someurl.com/st=????????

Except in this case, the ?'s are either upper case letters or numbers. The problem is, the string has garbage mixed in: the string is broken up into 5 or 6 pieces, and in between there's lots of junk: unprintable characters, foreign characters, as well as plain old normal characters. In short, stuff that's apt to look like this: Nyþ=mî;ëMÝ×nüqÏ

Usually the last 8 characters (the ?'s) are together right at the end, so at the moment I just have PHP grab the last 8 chars and hope for the best. Occasionally, that doesn't work, so I need a more robust solution.

The problem is technically unsolvable, but I think the best solution is to grab characters from the end of the string while they are upper case or numeric. If I get 8 or more, assume that is correct. Otherwise, find the st= and grab characters going forward as many as I need to fill up the 8 character quota. Is there a regex way to do this or will I need to roll up my sleeves and go nested-loop style?

update:

To clear up some confusion, I get an input string that's like this:

[garbage]http:/[garbage]/somewe[garbage]bsite.co[garbage]m/something=[garbage]????????

except the garbage is in unpredictable locations in the string (except the end is never garbage), and has unpredictable length (at least, I have been able to find patterns in neither). Usually the ?s are all together hence me just grabbing the last 8 chars, but sometimes they aren't which results in some missing data and returned garbage.

Upvotes: 0

Views: 2512

Answers (4)

Manoranjan
Manoranjan

Reputation: 1

You can use this regular expression :

if (preg_match('/[\'^£$%&*()}{@#~?><>,|=_+¬-]/', $string) ==1)

Upvotes: 0

Dereleased
Dereleased

Reputation: 10087

$var = '†http://þ=www.ex;üßample-website.î;ëcomÝ×ü/joy_hÏere.html'; // test case

$clean = join(
    array_filter(
        str_split($var, 1),
        function ($char) {
            return (
                array_key_exists(
                    $char,
                    array_flip(array_merge(
                        range('A','Z'),
                        range('a','z'),
                        range((string)'0',(string)'9'),
                        array(':','.','/','-','_')
                    ))
                )
            );
        }
    )
);

Hah, that was a joke. Here's a regex for you:

$clean = preg_replace('/[^A-Za-z0-9:.\/_-]/','',$var);

Upvotes: 6

Sparr
Sparr

Reputation: 7712

As stated, the problem is unsolvable. If the garbage can contain "plain old normal characters" characters, and the garbage can fall at the end of the string, then you cannot know whether the target string from this sample is "ABCDEFGH" or "BCDEFGHI":

__http:/____/somewe___bsite.co____m/something=__ABCDEFGHI__

Upvotes: 1

intgr
intgr

Reputation: 20466

What do these values represent? If you want to retain all of it, just without having to deal with garbage in your database, maybe you should hex-encode it using bin2hex().

Upvotes: 0

Related Questions