Ignacio
Ignacio

Reputation: 8035

Scrape web page and retrieve javascript variables

I need to scrape a web page that has a javascript array embeded in inline javascript code, such as:

<script>
    var videos = new Array();
    videos[0] = 'http://myvideos.com/video1.mov'; 
    videos[1] = ....
    ....
</script>

What's the easiest way to approach this and end up with a PHP array of these video urls?

Edit: All videos are .mov extension.

Upvotes: 2

Views: 1370

Answers (2)

Eugen Rieck
Eugen Rieck

Reputation: 65342

This is a bit more complicated, but it will get only those links, that are really of the form videos[0] = 'http://myvideos.com/video1.mov';

$tmp=str_replace(array("\r","\n"),'',$original,$matches);
$pattern='/\<script\>\s+var\ videos.*?((\s*videos\[\d+\]\ \=\ .http\:\/\/.*?\;\s*?)+)(.*?)\<\/script\>/';
$a=preg_match_all($pattern,$tmp,$matches);
unset($tmp);

if (!$a) die("no matches");

$pattern="/videos\[\d+\]\ \=\ /";
$matches=preg_split($pattern,$matches[1][0]);

$final=array();
while(sizeof($matches)>0) {
  $match=trim(array_shift($matches));
  if ($match=='') continue;
  $final[]=substr($match,1,-2);
}
unset($matches);

print_r($final);

After feedback from the OP here is the simplified version:

$original=file_get_contents($url);
$pattern='/http\:\/\/.*?\.mov/';
$a=preg_match_all($pattern,$original,$matches);
if (!$a) die("no matches");
print_r($matches[0]);

Upvotes: 1

Adrien Schuler
Adrien Schuler

Reputation: 2425

You can scrape this by reading the page with a file_get_contents then retrieve the urls with a regex. This is the simplest way i know, especially if you know the file extensions for your videos. Exemple:

<?php
$file = file_get_contents('http://google.com');
$pattern = '/http:\/\/([a-zA-Z0-9\-\.]+\.[fr|com]+)/i';
preg_match_all($pattern, $file, $matches);
var_dump($matches);

Upvotes: 1

Related Questions