chim
chim

Reputation: 51

Filter an array of urls which must contain specific text and not contain other text

I want to extract specific links from a website.

The links look like that:

/topic/Funny/G1pdeJm

The links are always the same - except the last random chars.

I'm getting hard time to combine these parts

(preg_match("/^http:\/\//i",$str) || is_file($str))

and

(preg_match("/Funny(.*)/", $str) || is_file($str))

first code extract every links second extract from the links only the /topic/Funny/* part.

Unfortunately, I can't combine them, also I want to also block these tags:

/topic/Funny/viral
/topic/Funny/time
/topic/Funny/top
/topic/Funny/top/week
/topic/Funny/top/month
/topic/Funny/top/year
/topic/Funny/top/all

Upvotes: 0

Views: 82

Answers (2)

mickmackusa
mickmackusa

Reputation: 47894

I'll prepare a battery of test strings and show the implementation of using a regex to filter the URLs.

Regex Breakdown:

^
http://                              #match literal characters
[^/]+                                #match one or more non-slash characters (domain portion)
/topic/Funny/                        #match literal characters
(?!                                  #not followed by:
   viral                             #viral
   |time                             #OR time
   |top(?:/week|/month|/year|/all)?  #OR top, top/week, top/month, top/year, top/all
)

Implementation: (Demo)

$tests = [
    'http://example.com/topic/Funny/G1pdeJm',
    'http://example.com/topic/Funny/viral',
    'http://example.com/topic/Funny/time',
    'http://example.com/topic/Funny/top',
    'http://example.com/topic/Funny/top/week',
    'http://example.com/topic/Funny/top/month',
    'http://example.com/topic/Funny/top/year',
    'http://example.com/topic/Funny/top/all',
    'http://example.com/topic/NotFunny/IL2dsRq',
];

$result = [];
foreach ($tests as $str) {
    if (preg_match('~^http://[^/]+/topic/Funny/(?!viral|time|top(?:/week|/month|/year|/all)?)~', $str)) {
        $result[] = $str;
    }
}
var_export($result);

Output:

array (
  0 => 'http://example.com/topic/Funny/G1pdeJm',
)

Upvotes: 0

Scott Weaver
Scott Weaver

Reputation: 7361

you could try using negative lookaheads to "filter out" the urls you don't like:

.*\/Funny\/(?!viral|time|top\/week|top\/month|top\/year|top\/all|top(\n|$)).*

demo here

Upvotes: 2

Related Questions