Reputation: 4832
I am trying to scrape a suppliers magento site in an effort to save some time because of there being around 2000 products I need to gather info for. I'm totally OK with writing a screen scraper for pretty much anything but i've encountered a major problem. Im using get_file_contentsto gather the html of the product page.
The problem is:
You need to be logged in, to view the product page. Its a standard magento login, so how can I get round this in my screen scraper? I don't require a full script, just advice on a method.
Upvotes: 2
Views: 2476
Reputation: 130
If you're familiar with CURL this should be relatively simple to do in a day or so. I've created some similar apps to login to banks to retrieve data - which of course also require authentication.
Below is a link with an example of how to use CURL with cookies for authentication purposes:
http://coderscult.com/php/php-curl/2008/05/20/php-curl-cookies-example/
If you can grab the output of the page you can parse for your results with a regex. Alternatively, you can use a class like Snoopy to do this work for you:
http://sourceforge.net/projects/snoopy/
Upvotes: 0
Reputation: 3461
Using stream_context_create you can specify headers to be sent when calling your file_get_contents
.
What I'd suggest is, open your browser and login to the site. Open up Firebug (or your favorite Cookie viewer) and grab the cookies and send them with your request.
Edit: Here's an example from PHP.net:
<?php
// Create a stream
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"Cookie: foo=bar\r\n"
)
);
$context = stream_context_create($opts);
// Open the file using the HTTP headers set above
$file = file_get_contents('http://www.example.com/', false, $context);
?>
Edit (2): This is out of the scope of your question, but if you are wondering how to scrape the website afterwards you could look into the DOMDocument::loadHTML method. This will essentially give you the required functions (i.e. XPath query, getElementsByTagName, getElementsById) to scrape what you need.
If you want to scrape something simple, you can also use RegEx with preg_match_all.
Upvotes: 2