ComputerLocus
ComputerLocus

Reputation: 3618

Preg_split matching more than what it should

Code:

    $pattern = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
    $urls = array();
    preg_match($pattern, $comment, $urls);

    return $urls;

According to an online regex tester, this regex is correct and should be working:

http://regexr.com?35nf9

I am outputting the $links array using:

$linkItems = $model->getLinksInComment($model->comments);
//die(print_r($linkItems));
echo '<ul>';
foreach($linkItems as $link) {
    echo '<li><a href="'.$link.'">'.$link.'</a></li>';
}
echo '</ul>';

The output looks like the following:

The $model->comments looks like the following:

destined for surplus
RT#83015
RT#83617
http://google.com
https://google.com
non-link

The list generated is only suppose to be links, and there should be no lines that are empty. Is there something wrong with what I did, because the Regex seems to be correct.

Upvotes: 0

Views: 125

Answers (2)

user428517
user428517

Reputation: 4193

If I'm understanding right, you should use preg_match_all in your getLinksInComment function instead:

preg_match_all($pattern, $comment, $matches);

if (isset($matches[0])) {
    return $matches[0];
}
return array();    #in case there are no matches

preg_match_all gets all matches in a string (even if the string contains newlines) and puts them into the array you supply as the third argument. However, anything matched by your regex's capture groups (e.g. (http|https|ftp|ftps)) will also be put into your $matches array (as $matches[1] and so on). That's why you want to return just $matches[0] as your final array of matches.

I just ran this exact code:

$line = "destined for surplus\n
RT#83015\n
RT#83617\n
http://google.com\n
https://google.com\n
non-link";

$pattern = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
preg_match_all($pattern, $line, $matches);

var_dump($matches);

and got this for my output:

array(3) {
  [0]=>
  array(2) {
    [0]=>
    string(17) "http://google.com"
    [1]=>
    string(18) "https://google.com"
  }
  [1]=>
  array(2) {
    [0]=>
    string(4) "http"
    [1]=>
    string(5) "https"
  }
  [2]=>
  array(2) {
    [0]=>
    string(0) ""
    [1]=>
    string(0) ""
  }
}

Upvotes: 1

Aaron Miller
Aaron Miller

Reputation: 3780

Your comment is structured as multiple lines, some of which contain the URLs in which you're interested and nothing else. This being the case, you need not use anything remotely resembling that disaster of a regex to try to pick URLs out of the full comment text; you can instead split by newline, and examine each line individually to see whether it contains a URL. You might therefore implement a much more reliable getLinksInComment() thus:

function getLinksInComment($comment) {
    $links = array();
    foreach (preg_split('/\r?\n/', $comment) as $line) {
        if (!preg_match('/^http/', $line)) { continue; };
        array_push($links, $line);
    };
    return $links;
};

With suitable adjustment to serve as an object method instead of a bare function, this should solve your problem entirely and free you to go about your day.

Upvotes: 0

Related Questions