PHP Magento Screen Scraping

Question

I am trying to scrape a suppliers magento site in an effort to save some time because of there being around 2000 products I need to gather info for. I'm totally OK with writing a screen scraper for pretty much anything but i've encountered a major problem. Im using get_file_contentsto gather the html of the product page.

The problem is:

You need to be logged in, to view the product page. Its a standard magento login, so how can I get round this in my screen scraper? I don't require a full script, just advice on a method.

Christian Joudrey · Accepted Answer

Using stream_context_create you can specify headers to be sent when calling your file_get_contents.

What I'd suggest is, open your browser and login to the site. Open up Firebug (or your favorite Cookie viewer) and grab the cookies and send them with your request.

Edit: Here's an example from PHP.net:

array(
    'method'=>"GET",
    'header'=>"Accept-language: en
" .
              "Cookie: foo=bar
"
  )
);

$context = stream_context_create($opts);

// Open the file using the HTTP headers set above
$file = file_get_contents('http://www.example.com/', false, $context);
?>

Edit (2): This is out of the scope of your question, but if you are wondering how to scrape the website afterwards you could look into the DOMDocument::loadHTML method. This will essentially give you the required functions (i.e. XPath query, getElementsByTagName, getElementsById) to scrape what you need.

If you want to scrape something simple, you can also use RegEx with preg_match_all.

PHP Magento Screen Scraping

Answers (2)

Related Questions