Basic web-crawling question: How to create a list of all pages on a website using php?

Question

I would like to create a crawler using php that would give me a list of all the pages on a specific domain (starting from the homepage: www.example.com).

How can I do this in php?

I don't know how to recursively find all the pages on a website starting from a specific page and excluding external links.

Paul Dixon · Accepted Answer

For the general approach, check out the answers to these questions:

In PHP, you should be able to simply fetch a remote URL with file_get_contents(). You could perform a naive parse of the HTML by using a regular expression with preg_match() to find tags and parse the URL out of them (See this question for some typical approaches).

Once you've extract the raw href attribute, you could use parse_url() to break into it components and figure out if its a URL you want to fetch - remember also the URLs may be relative to the page you've fetched.

Though fast, a regex isn't the best way of parsing HTML though - you could also try the DOM classes to parse the HTML you fetch, for example:

$dom = new DOMDocument();
$dom->loadHTML($content);

$anchors = $dom->getElementsByTagName('a');

if ( count($anchors->length) > 0 ) {
    foreach ( $anchors as $anchor ) {
        if ( $anchor->hasAttribute('href') ) {
            $url = $anchor->getAttribute('href');

            //now figure out whether to processs this
            //URL and add it to a list of URLs to be fetched
        }
    }
}

Finally, rather than write it yourself, see also this question for other resources you could use.

is there a good web crawler library available for PHP or Ruby?

Basic web-crawling question: How to create a list of all pages on a website using php?

Answers (2)

Related Questions