far2005
far2005

Reputation: 64

Regex to parse an imdb page and get the name

I'm not very good at regex and looked everywhere i could. I could use some help to parse this page (http://www.imdb.com/search/title?count=100&groups=oscar_best_picture_winners&sort=year,desc&ref_=nv_ch_osc_3) to get the movies name . P.S: Could use a dummy regex too.

Upvotes: 0

Views: 1779

Answers (2)

Steven
Steven

Reputation: 6148

Short Answer

This is almost the same problem as your previous question and the answer is the same... Albeit with a modified regex.

#<td class="number">(\d+).</td>.*?<a href="/title/tt\d+/">(.*?)</a>#s

https://stackoverflow.com/a/19600974/2573622


Expanded answer

About regex

For more information you might want to check out the following link:

http://www.regular-expressions.info/

Click on Tutorial on the top menu bar and there are explanations about just about everything regex.

Making the regex

Firstly, you have to get the relevant html (for one movie) from the page...

<td class="number">RANK.</td>
  <td class="image">
    <a href="/title/tt000000/" title="FILM TITLE (YEAR)"><img src="http://imdb.com/path-to-image.jpg" height="74" width="54" alt="FILM TITLE (YEAR)" title="FILM TITLE (YEAR)"></a>
  </td>
  <td class="title">
    

<span class="wlb_wrapper" data-tconst="tt000000" data-size="small" data-caller-name="search"></span>

    <a href="/title/tt000000/">FILM TITLE</a>

You then strip out the noise/changeable info...

<td class="number">RANK.</td>.*?<a href="/title/tt\d+/">FILM TITLE</a>

Then add your capture groups...

<td class="number">(RANK).</td>.*?<a href="/title/tt\d+/">(FILM TITLE)</a>

and that's it:

 #<td class="number">(\d+).</td>.*?<a href="/title/tt\d+/">(.*?)</a>#s

The s modifier after the ending pattern delimiter makes the regex engine make . match new lines as well

With code

Same as in previous answer (with modified regex)

$page = file_get_contents('http://www.imdb.com/search/title?count=100&groups=oscar_best_picture_winners&sort=year,desc&ref_=nv_ch_osc_3');

preg_match_all('#<td class="number">(\d+).</td>.*?<a href="/title/tt\d+/">(.*?)</a>#s', $page, $matches);


$filmList = array_combine($matches[1], $matches[2]);

Then you can do:

echo $filmList[1];

/**
Output:

Argo

*/

echo array_search("The Artist", $filmList);

/**
Output:

2

*/

http://php.net/manual/en/reference.pcre.pattern.modifiers.php
http://php.net/file_get_contents
http://php.net/preg_match_all
http://php.net/array_combine
http://php.net/array_search

Upvotes: 3

user645280
user645280

Reputation:

Not sure which backslashes you do/don't need:

href=\"\/title\/tt.*height=\"74\" width=\"54\" alt=\"([^"]*)\"

useful result is \1 or $1

Upvotes: 0

Related Questions