El Classico
El Classico

Reputation: 29

How can I extract the links from a page of HTML?

I am trying to download a file in php.

$file = file_get_contents($url);

How should i download the contents of the links within the file in $url...

Upvotes: 0

Views: 1471

Answers (3)

Dennis G
Dennis G

Reputation: 21778

So you want to find all URLs in a given file? RegEx to the rescue... and some sample code below which should do what you want:

$file = file_get_contents($url);
if (!$file) return;
$file = addslashes($file);

//extract the hyperlinks from the file via regex
preg_match_all("/http:\/\/[A-Z0-9_\-\.\/\?\#\=\&]*/i", $file, $urlmatches);

//if there are any URLs to be found
if (count($urlmatches)) {
    $urlmatches = $urlmatches[0];
    //count number of URLs
    $numberofmatches = count($matches);
    echo "Found $numberofmatches URLs in $url\n";

    //write all found URLs line by line
    foreach($urlmatches as $urlmatch) {
        echo "URL: $urlmatch...\n";
    }
}

EDIT: When I understand your question correctly, you now want to download the contents of the found URLs. You would do that in the foreach loop calling file_get_contents for each URL, but you probably want to do some filtering beforehand (like don't download images etc.).

Upvotes: 1

Nathan
Nathan

Reputation: 11149

This requires parsing HTML, which is quite a challenge in PHP. To save you a lot of trouble, download an HTML parsing library, such as PHPQuery (http://code.google.com/p/phpquery/). Then you'll have to select all the links with pq('a'), loop through them getting their href attribute values, and for each one, convert it from relative to absolute and run a file_get_contents on the resulting URL. Hopefully these pointers should get you started.

Upvotes: 2

Dutchie432
Dutchie432

Reputation: 29160

You'll need to parse the resulting HTML string, either manually, or via a 3rd party plugin.

HTML Scraping in Php

Upvotes: 0

Related Questions