Regex to parse an imdb page and get the name

Question

I'm not very good at regex and looked everywhere i could. I could use some help to parse this page (http://www.imdb.com/search/title?count=100&groups=oscar_best_picture_winners&sort=year,desc&ref_=nv_ch_osc_3) to get the movies name . P.S: Could use a dummy regex too.

Steven · Accepted Answer

Short Answer

This is almost the same problem as your previous question and the answer is the same... Albeit with a modified regex.

#(\d+)..*?(.*?)#s

https://stackoverflow.com/a/19600974/2573622

Expanded answer

About regex

For more information you might want to check out the following link:

http://www.regular-expressions.info/

Click on Tutorial on the top menu bar and there are explanations about just about everything regex.

Making the regex

Firstly, you have to get the relevant html (for one movie) from the page...

RANK.
  
    
  
  
    



    FILM TITLE

You then strip out the noise/changeable info...

RANK..*?FILM TITLE

Then add your capture groups...

(RANK)..*?(FILM TITLE)

and that's it:

 #(\d+)..*?(.*?)#s

The s modifier after the ending pattern delimiter makes the regex engine make . match new lines as well

With code

Same as in previous answer (with modified regex)

$page = file_get_contents('http://www.imdb.com/search/title?count=100&groups=oscar_best_picture_winners&sort=year,desc&ref_=nv_ch_osc_3');

preg_match_all('#(\d+)..*?(.*?)#s', $page, $matches);


$filmList = array_combine($matches[1], $matches[2]);

Then you can do:

echo $filmList[1];

/**
Output:

Argo

*/

echo array_search("The Artist", $filmList);

/**
Output:

2

*/

http://php.net/manual/en/reference.pcre.pattern.modifiers.php
http://php.net/file_get_contents
http://php.net/preg_match_all
http://php.net/array_combine
http://php.net/array_search