Reputation: 51
I want to extract specific links from a website.
The links look like that:
/topic/Funny/G1pdeJm
The links are always the same - except the last random chars.
I'm getting hard time to combine these parts
(preg_match("/^http:\/\//i",$str) || is_file($str))
and
(preg_match("/Funny(.*)/", $str) || is_file($str))
first code extract every links second extract from the links only the /topic/Funny/* part.
Unfortunately, I can't combine them, also I want to also block these tags:
/topic/Funny/viral
/topic/Funny/time
/topic/Funny/top
/topic/Funny/top/week
/topic/Funny/top/month
/topic/Funny/top/year
/topic/Funny/top/all
Upvotes: 0
Views: 82
Reputation: 47894
I'll prepare a battery of test strings and show the implementation of using a regex to filter the URLs.
Regex Breakdown:
^
http:// #match literal characters
[^/]+ #match one or more non-slash characters (domain portion)
/topic/Funny/ #match literal characters
(?! #not followed by:
viral #viral
|time #OR time
|top(?:/week|/month|/year|/all)? #OR top, top/week, top/month, top/year, top/all
)
Implementation: (Demo)
$tests = [
'http://example.com/topic/Funny/G1pdeJm',
'http://example.com/topic/Funny/viral',
'http://example.com/topic/Funny/time',
'http://example.com/topic/Funny/top',
'http://example.com/topic/Funny/top/week',
'http://example.com/topic/Funny/top/month',
'http://example.com/topic/Funny/top/year',
'http://example.com/topic/Funny/top/all',
'http://example.com/topic/NotFunny/IL2dsRq',
];
$result = [];
foreach ($tests as $str) {
if (preg_match('~^http://[^/]+/topic/Funny/(?!viral|time|top(?:/week|/month|/year|/all)?)~', $str)) {
$result[] = $str;
}
}
var_export($result);
Output:
array (
0 => 'http://example.com/topic/Funny/G1pdeJm',
)
Upvotes: 0
Reputation: 7361
you could try using negative lookaheads to "filter out" the urls you don't like:
.*\/Funny\/(?!viral|time|top\/week|top\/month|top\/year|top\/all|top(\n|$)).*
Upvotes: 2