Reputation: 2976

How to filter URLs that contain white space with preg match?

I parse through a text that contains several links. Some of them contain white spaces but have a file ending. My current pattern is:

preg_match_all('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $links, $match);

This works the same way:

preg_match_all('/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/', $links, $match);

I don't know much about the patterns and didn't find a good tutorial that explains the meaning of all possible patterns and shows examples.

How could I filter an URL like this: http://my-url.com/my doc.doc or even http://my-url.com/my doc with more white spaces.doc

The \s in that preg_match_all functions stands for a white space. But how could I check if there is a file ending behind one or some white spaces?

Is it possible?

Upvotes: 0

Answers (6)

OfirH

Reputation: 657

I think this should work:

$url = '...';
$url_new = '';
$array = explode(' ',$url);

foreach($array as $name => $val){
    if ($val!=' '){
         $url_new = $url_new.$val;
    }
}

Upvotes: 0

Class

Reputation: 3160

this might be what you are looking for which uses urlencode

$file = "my doc with more white spaces.doc";
echo " http://my-url.com/" . urlencode($file);

which produces:

http://my-url.com/my+doc+with+more+white+spaces.doc

or with rawurlencode

produces:

http://my-url.com/my%20doc%20with%20more%20white%20spaces.doc

EDIT: Something like the following might help to parse your urls with parse_url

DEMO

$url = 'http://my-url.com/my doc with more white spaces.doc';
$purl = parse_url($url);
$rurl = "";
if(isset($purl['scheme'])){
    $rurl .= $purl['scheme'] . "://";
}
if(isset($purl['host'], $purl['path'])){
    $rurl .= $purl['host'] . rawurlencode($purl['path']);
}
if($rurl === ""){
    $rurl = $url;#error parsing error/invalid url?
}

for sub directories you can do

$purl['path'] = implode('/', array_map(function($value){return rawurlencode($value);}, explode('/', $purl['path'])));

Upvotes: 2

Amit Joki

Reputation: 59262

I don't know much about php but this regex

(http|ftp)(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?

will match every url even with spaces

I think this regex will do.

Upvotes: 1

user2718671

Reputation: 2976

Alright after doing this really helpful tutorial I finally know how the regex syntax works. After finishing it I experimented a bit on this site

It was pretty easy after figuring out that all hyperlinks in my parsed document were in between quotation marks so I just had to change the regex to:

preg_match_all('#\bhttps?://[^()<>"]+#', $links, $match);

so that after the " it is looking for the next match that begins with http.

But that's not the full solution yet. The user Class was right - without rawurlencode the filenames it won't work.

So the next step was this:

function endsWith($haystack, $needle)
{
    return $needle === "" || substr($haystack, -strlen($needle)) === $needle;
}

if(endsWith($textlink, ".doc") || endsWith($textlink, ".docx") || endsWith($textlink, ".pdf") || endsWith($textlink, ".jpg") || endsWith($textlink, ".jpeg") || endsWith($textlink, ".png")){
        $file = substr( $textlink, strrpos( $textlink, '/' )+1 );
        $rest_url=substr($textlink, 0, strrpos($textlink, '/' )+1 );
        $textlink=$rest_url.rawurlencode($file);            
    }

That filters the filenames from the URLs and rawurlencodes them so that the the output links are correct.

Upvotes: 0

Nambi

Reputation: 12042

use this regex

preg_match_all("/^(?si)(?>\s*)(((?>https?:\/\/(?>www\.)?)?(?=[\.-a-z0-9]{2,253}(?>$|\/|\?|\s))[a-z0-9][a-z0-9-]{1,62}(?>\.[a-z0-9][a-z0-9-]{1,62})+)(?>(?>\/|\?).*)?)?(?>\s*)$/", $input_lines, $output_array);

Demo

Upvotes: 0

Shankar Narayana Damodaran

Reputation: 68526

Why not just make use of PHP's FILTER functions. ?

<?php
$url = "http://my-url.com/my doc.doc";

if(!filter_var($url, FILTER_VALIDATE_URL))
{
    echo "URL is not valid";
}
else
{
    echo "URL is valid";
}

OUTPUT :

URL is not valid

Upvotes: 2

How to filter URLs that contain white space with preg match?

Answers (6)

Related Questions