Tamil
Tamil

Reputation: 5358

Selenium Webdriver JS Scraping Parallely [nodejs]

I'm trying to create a pool of Phantom Webdrivers [using webdriverjs] like

var driver = new Webdriver.Builder().withCapabilities(Webdriver.Capabilities.phantomjs()).build();

Once the pool gets populated [I see n-number of phantom processes spawned], I try to do driver.get [using different drivers in the pool] of different urls expecting them to work parallely [as driver.get is async].

But I always see them being done sequentially. Can't we load different urls parallely via different web driver instances? If not possible in this way how else could I solve this issue?

Very Basic Impl of my question would look like below

var Webdriver = require('selenium-webdriver'),

function getInstance() {
   return new Webdriver.Builder().withCapabilities(Webdriver.Capabilities.phantomjs()).build();
}

var pool = [];
for (var i = 0; i < 3; i++) {
  pool.push(getInstance());
}
pool[0].get("http://mashable.com/2014/01/14/outdated-web-features/").then(function () { 
  console.log(0);
});

pool[1].get("http://google.com").then(function () { 
  console.log(1);
});

pool[2].get("http://techcrunch.com").then(function () { 
  console.log(2);
});

PS: Have already posted it here

Update: I tried with selenium grid with the following setup; as it was mentioned that it can run tests parallely

Hub:

java -jar selenium/selenium-server-standale-2.39.0.jar -hosost 127.0.0.1 -port 4444 -role hub -nodeTimeout 600

Phantom:

phantomjs --webdriver=7777 --webdriver-selium-grid-hub=http://127.0.0.1:4444 --debug=true
phantomjs --webdriver=7877 --webdriver-selium-grid-hub=http://127.0.0.1:4444 --debug=true
phantomjs --webdriver=6777 --webdriver-selium-grid-hub=http://127.0.0.1:4444 --debug=true

Still I see the get command getting queued and executed sequentially instead being parall. [But gets properly distributed across 3 instances]

Am I still missing something out?

Why is it mentioned "scale by distributing tests on several machines ( parallel execution )" in the doc?

What is parallel as per the hub? I'm getting clueless

Upvotes: 1

Views: 1966

Answers (3)

Gibbonson
Gibbonson

Reputation: 93

A little late but for me it worked with webdriver.promise.createFlow. You just have to wrap your code in webdriver.promise.createFlow() { ... }); and it works for me! Here's an example from Make parallel requests to a Selenium Webdriver grid. All thanks to the answerer there...

var flows = [0,1,2,3].map(function(index) {
 return webdriver.promise.createFlow(function() {
   var driver = new webdriver.Builder().forBrowser('firefox').usingServer('http://someurl:44111/wd/hub/').build();

   console.log('Get');
   driver.get('http://www.somepage.com').then(function() {

        console.log('Screenshot');
        driver.takeScreenshot().then(function(data){

            console.log('foo/test' + index + '.png');
            //var decodedImage = new Buffer(data, 'base64')

            driver.quit();
        });
    });
 });
});

Upvotes: 1

Gabriel
Gabriel

Reputation: 327

I had the same issues, I finally got around the problem using child_process.

The way my app is setup is that I have many tasks that does different things, and that needs to run simultaneously (each of those use a different driver instance), obviously it was not working. I now start those tasks in a child_process (which will run a new V8 process) and it does run everything in parallel.

Upvotes: 0

Tamil
Tamil

Reputation: 5358

I guess I got the issue..

Basically https://code.google.com/p/selenium/source/browse/javascript/node/selenium-webdriver/executors.js#39 Is synchronous and blocking operation [atleast the get]. Whenever the get command is issued node's main thread get's stuck there. No further code execution.

Upvotes: 1

Related Questions