Reputation: 7400
I have to retrieve this url from a dirty html page:
......... http://www.imdb.com/title/tt0092699/ ......
Obviously url can also be
http://www.imdb.co.uk/title/tt0092699/ http://www.imdb.es/title/tt0092699/ http://www.imdb.com/title/tt0092699 https://www.imdb.com/title/tt0092699/ https://www.imdb.com/title/tt0092699
(.domain, http/https or without final slash)
Upvotes: 2
Views: 125
Reputation: 17508
This would work nicely and it would also match URL's starting with // which is a protocol independent 'scheme'.
(https?:|//)[^\s"]+
Upvotes: 0
Reputation: 53329
Use this regex:
preg_match("/https?:\/\/www.imdb\..*?\/title\/tt\d+\/?/", $html, $matches);
The url you want will be in $matches[0]
.
Here's the regex meaning, broken down piece by piece:
/
=> start regexhttps?
=> literal http
followed by optional s
:\/\/www.imdb\.
=> literal ://www.imdb.
.*?\/
=> matches the shortest string possible before a slash, then a slash; will match the domain end, whatever it is (com
, co.uk
, es
, etc...) and the first slash following ittitle\/
=> literal title/
tt\d+
=> literal tt
followed by at least one digit (and it's a greedy match, so it will match the most number of consecutive digits it can); will match ids in the format you provided\/?
=> optional final /
/
=> end regexUpvotes: 4