Reputation: 17013
I'm trying to get this to work with perl's regex but can't seem to figure it out.. I want to grab any url that has ".website." in it, except ones that are like this (with "en" preceding ".website."
$linkhtml = 'http://en.search.website.com/?q=beach&' ;
This is an example of a url that I would want to be returned by the regex, while the one above is rejected
$linkhtml = ' http://exsample.website.com/?q=beach&' ;
Here is my attempt at it.. any advice on what I'm doing wrong is appreciated
$re2='(?<!en)'; # Any number of characters
$re4='(.*)'; # Any number of characters
$re6='(\.)'; # Any Single Character 4
$re7='(website)'; # Word 2
$re8='(\.)'; # Any Single Character 5
$re9='(.*)'; # Any number of characters
$re=$re4.$re2.$re6.$re7.$re8.$re9;
if ($linkhtml =~ /$re/)
Upvotes: 1
Views: 233
Reputation: 17013
Here's the final solution, in case anyone comes across this in the future that is new to regex (as I am) and has a similar problem.. in my case I wrapped this is a "for loop" so it would go through an array but it just depends on the need.
$re1='(.*)'; # Any number of characters
$re2='(en)'; # Word 1
$re3='(.*)'; # Any number of characters
$re=$re1.$re2.$re3;
if ($linkhtml =~ /$re/)
{
#do nothing, as we don't want a link with "en" in it
}
else {
### find urls with ".website."
$re1='(.*)'; # Any number of characters
$re2='(\.)'; # period
$re3='(website)'; # Word 1
$re4='(\.)'; # period
$re5='(.*)'; # Any number of characters
$re=$re1.$re2.$re3.$re4.$re5;
if ($linkhtml =~ /$re/) {
#match to see if it is a link that has ".website." in it
## do something with the data as it matches, such as:
print "linkhtml
}
}
Upvotes: 0
Reputation: 53966
Negative lookbehind assertions don't work well if the content you are trying to match after the assertion is so general that it would match the assertion itself. Consider:
perl -wle'print "en.website" =~ qr/(?<!en\.)web/' # doesn't match
perl -wle'print "en.website" =~ qr/(?<!en\.)[a-z]/' # does match, because [a-z] is matching the 'en'
The best thing to do here is what David suggested: use two patterns to screen out the good and bad values:
my @matches = grep {
/$pattern1/ and not /$pattern2/
} @strings;
...where pattern1 matches all URLs, and pattern2 matches just the 'en' URLs.
Upvotes: 1
Reputation: 131600
I'd just do it in two steps: first use a generic regular expression to check for any URL (or rather, anything that looks like a URL). Then check each result that matches that against another regex that looks for en
occurring in the host before wordpress
, and discard anything that matches.
Upvotes: 1