How to retrieve an URL inside a HTML page?

Question

I have to retrieve this url from a dirty html page:

Obviously url can also be

(.domain, http/https or without final slash)

Ben Lee · Accepted Answer

Use this regex:

preg_match("/https?:\/\/www.imdb\..*?\/title\/tt\d+\/?/", $html, $matches);

The url you want will be in $matches[0].

Here's the regex meaning, broken down piece by piece:

/ => start regex
https? => literal http followed by optional s
:\/\/www.imdb\. => literal ://www.imdb.
.*?\/ => matches the shortest string possible before a slash, then a slash; will match the domain end, whatever it is (com, co.uk, es, etc...) and the first slash following it
title\/ => literal title/
tt\d+ => literal tt followed by at least one digit (and it's a greedy match, so it will match the most number of consecutive digits it can); will match ids in the format you provided
\/? => optional final /
/ => end regex

Answers (2)