sparkle
sparkle

Reputation: 7400

How to retrieve an URL inside a HTML page?

I have to retrieve this url from a dirty html page:

......... http://www.imdb.com/title/tt0092699/ ......

Obviously url can also be

http://www.imdb.co.uk/title/tt0092699/ http://www.imdb.es/title/tt0092699/ http://www.imdb.com/title/tt0092699 https://www.imdb.com/title/tt0092699/ https://www.imdb.com/title/tt0092699

(.domain, http/https or without final slash)

Upvotes: 2

Views: 125

Answers (2)

Razor
Razor

Reputation: 17508

This would work nicely and it would also match URL's starting with // which is a protocol independent 'scheme'.

(https?:|//)[^\s"]+

Upvotes: 0

Ben Lee
Ben Lee

Reputation: 53329

Use this regex:

preg_match("/https?:\/\/www.imdb\..*?\/title\/tt\d+\/?/", $html, $matches);

The url you want will be in $matches[0].

Here's the regex meaning, broken down piece by piece:

  • / => start regex
  • https? => literal http followed by optional s
  • :\/\/www.imdb\. => literal ://www.imdb.
  • .*?\/ => matches the shortest string possible before a slash, then a slash; will match the domain end, whatever it is (com, co.uk, es, etc...) and the first slash following it
  • title\/ => literal title/
  • tt\d+ => literal tt followed by at least one digit (and it's a greedy match, so it will match the most number of consecutive digits it can); will match ids in the format you provided
  • \/? => optional final /
  • / => end regex

Upvotes: 4

Related Questions