lil_bugga
lil_bugga

Reputation: 91

URL Validation/Sanitization with Regular Expressions

I'm a little out of my depth here but believe I am now on the right track. I want to take user supplied url's and store them in a database so that the links can then be used on a user profile page.

Now the links I'm hoping the users will supply will be for social media site, facebook and the like. Whilst looking for a solution to safely storing user supplied url's I found this page http://electrokami.com/coding/use-php-to-format-and-validate-a-url-with-these-easy-functions/. The code works but seems to remove nearly everything. If I used "www.example.com/user.php?u=borris" it just returns example.com is valid.

Then I found out about regular expressions and found this line of code

/(?:https?:\/\/)?(?:www\.)?facebook\.com\/(?:(?:\w)*#!\/)?(?:pages\/)?(?:[\w\-]*\/)*([\w\-\.]*)/

from this site https://gist.github.com/marcgg/733592 and another stack overflow post Check if a string contains a url and get contents of url php.

I tried to merge the code together so that I get something that would validate the link for a facebook profile or page. I don't want to get profile info, pics etc but my code's not right either, so rather than getting deeper into stuff I don't fully understand yet I thought asking for help was best.

Below is the code I mashed together which gave me the error "Warning: preg_match_all() [function.preg-match-all]: Compilation failed: unmatched parentheses at offset 29... on line 9"

<?php
// get url to check from the page parameter 'url'
// or use default http://example.com
$text = isset($_GET['url']) 
? $_GET['url'] 
: "http://www.vwrx-project.co.uk/user.php?u=borris";

$reg_exurl =             "/(?:http|https|ftp|ftps)?:\/\/)?(?:www\.)?facebook\.com\/(?:(?:\w)*#!\/)?(?:pages\/)?(?:[\w\-]*\/)*([\w\-\.]*)/";
preg_match_all($reg_exurl, $text, $matches);
$usedPatterns = array();
$url = '';
foreach($matches[0] as $pattern){
    if(!array_key_exists($pattern, $usedPatterns)){
        $usedPatterns[$pattern] = true;
        $url = $pattern;
    }
}

?>

--------------------------------------------------------- Additional ------------------------------------------------------------ I took a fresh look at the answer Dave provided me with today and felt I could work with it, it makes more sense to me from a code perspective as I can follow the process etc.

I got a system I'm partly happy with. If I supply a link http://www.facebook.com/#!/lilbugga which is a typical link from facebook (when clicking on your username/profile pic from your wall) I can get the result http://www.facebook.com/lilbugga which shows as valid.

What it can't handle is the link from facebook that isn't in a vanity/seo friendly format such as https://www.facebook.com/profile.php?id=4. If I allow my code to accept ? and = then I suspect I'm leaving my website/database open to attack which I don't want.

Whats the best option now? This is the code I have

<?php   
$dirty_url = "http://www.facebook.com/profile.php?id=4";  //user supplied link

//clean url leaving alphanumerics : / . only -  required to remove facebook link format with /#!/
$clean_url = preg_replace('#[^a-z0-9:/.]#i', '', $dirty_url); 

$parsed_url = parse_url($clean_url); //parse url to get brakedown of components

$safe_host = $parsed_url['host']; // safe host direct from parse_url

// str_replace to switch any // to a / inside the returned path - required due to preg_replace process above
echo $safe_path = str_replace("//", "/", ($parsed_url['path']));

if ($parsed_url['host'] == 'www.facebook.com') {
  echo "<a href=\"http://$safe_host$safe_path\" alt=\"facebook\" target=\"_new\">Facebook</a>";
} else {
    echo " :( invalid url";
}
?>

Upvotes: 1

Views: 2875

Answers (2)

Braj
Braj

Reputation: 46841

I have taken some regex pattern from HERE

Get the matched groups.

(?:http|https|ftp|ftps(?:\/\/)?)?(?:www.|[-;:&=\+\$,\w]+@)([A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??((?:[-\+=&;%@.\w_]*)#?(?:[\w]*)?))

Online demo

Input:

www.example.com/user.php?u=borris
http://www.vwrx-project.co.uk/user.php?u=borris

Output:

MATCH 1
1.  [4-15]  `example.com`
2.  [15-33] `/user.php?u=borris`
3.  [25-33] `u=borris`
MATCH 2
1.  [45-63] `vwrx-project.co.uk`
2.  [63-81] `/user.php?u=borris`
3.  [73-81] `u=borris`

Upvotes: 0

dave
dave

Reputation: 64657

Not sure exactly what you are trying to accomplish, but it sounds like you could use parse_url for this:

<?php
   $parsed_url = parse_url($_GET['url']);
   //assume it's "http://www.vwrx-project.co.uk/user.php?u=borris"
   print_r($parsed_url);
   /*
     Array
     (
         [scheme] => http
         [host] => www.vwrx-project.co.uk
         [path] => /user.php
         [query] => u=borris
     )
   */
   if ($parsed_url['host'] == 'www.facebook.com') {
      //do stuff
   }
?>

Upvotes: 1

Related Questions