Alex Heebs
Alex Heebs

Reputation: 830

JavaScript is bogging down Selenium for Python

So I want to scrape a website that uses JavaScript/AJAX to generate additional results as you scroll down the page. I am using Python 3.7 with Selenium Chrome running headless. However, as scraping progresses, you end up with an ever expanding amount of code, which slows down my machine until it is at a standstill. Even simple operations like –

code = driver.page_source

– grow to take several seconds. I ran a test to see how much the codebase had grown, after a few hundred results it had expanded from an initial length of about a half-million characters to 25 million characters – 50 fold! My question is this:

1) Is there some way to have Selenium delete prior code (similar to the way you can delete it in Chrome's "inspect element" mode) to keep the size manageable?

2) Or is there some other simple solution that I'm overlooking?

Upvotes: 0

Views: 37

Answers (1)

pbuck
pbuck

Reputation: 4561

One suggestion would be to look at the javascript which is being run and execute something similar, in python, rather than simply relying on selenium.

I don't know what website you're doing, but sounds like it's doing a series of AJAX calls, loading another page & another page of results (images /posts /whatever).

Reverse engineer the JS -- it's probably doing the same AJAX call over and over, passing in a parameter or two. Figure out how the JS calculates the passed in parameter (is it a timestamp, or ID of "last" element received, etc.)

Then, rather than having selenium do the work, use python requests, doing the equivalent POST. Retrieve the data (likely json or html), parse it for what you need & then repeat.

Depending on the site you're looking at, this can be orders of magnitude faster.

Upvotes: 1

Related Questions