Reputation: 1456
I want to start off by saying that we only scrape our own account, because my company needs data from our own dashboard that we can't get from the MWS APIs. I am very familiar with those APIs.
I've had login/scraping scripts for years. But recently Amazon started offering up captchas. My old way of scraping was from PHP making cURL requests to mimic the browser.
My new approach is using PhantomJS and CasperJS to achieve the same effect. Everything was working fine for a day, but I'm getting captcha again.
Now, I happen to know from internal sources that Amazon isn't doing any scrape detection. They do however do hacking / DDOS attack detection. So I think something about this casperJS code is getting flagged as an attack.
I don't think I'm calling the script too often. And I've changed my IP address that the requests are coming from.
Here is some casperJS code
var fs = require('fs');
var casper = require('casper').create({
pageSettings: {
loadImages: false,
loadPlugins: false,
userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'
}
});
// use any cookies
var cookieFilename = "cookies/_cookies.txt";
var data = fs.read(cookieFilename);
if(data) {
phantom.cookies = JSON.parse(data);
}
//First step is to open Amazon
casper.start("https://sellercentral.amazon.com/gp/homepage.html", function() {
console.log("Amazon website opened");
});
casper.wait(1000, function() {
if(this.exists("form[name=signinWidget]")) {
console.log("need to login");
//Now we have to populate username and password, and submit the form
casper.wait(1000, function(){
console.log("Login using username and password");
this.evaluate(function(){
document.getElementById("username").value="*****";
document.getElementById("password").value="*****";
document.querySelector("form[name=signinWidget]").submit();
});
});
// write the cookies
casper.wait(1000, function() {
var cookies = JSON.stringify(phantom.cookies);
fs.write(cookieFilename, cookies, 644);
})
} else {
console.log("already logged in");
}
});
//Wait to be redirected to the Home page, and then make a screenshot
casper.wait(1000, function(){
console.log("is login found?");
console.log(this.exists("form[name=signinWidget]"));
this.echo(this.getPageContent());
});
casper.run();
The result of that last line is just a login page with captcha. What gives? This should be a normal browser. When I use the same login on my computer, I get no issues at all.
I've also tried several different user agent strings. Sometimes changing those works temporarily.
Also, when I load all this locally, it works fine. But on the linux server it get's the captcha. Note that I've changed the IP on the remote linux server many times. It still get's the captcha.
Upvotes: 1
Views: 2303
Reputation: 16838
As it often happens with scraping/automation the reason for errors is not necessarily incorrectly written script, but also the context, underlying infrastructure.
In this case we determined (in comments) that the script was challenged with captcha only when run from a particular server, IP-address of which seems to have been put in an untrusted list.
Upvotes: 0