Scrape JS generated content with Scrapy and Python

Question

There is a web page which is partially generated with JS: https://www.ncbi.nlm.nih.gov/genome/genomes/971

I want to scrape the links in FTP column. All of them are JS generated.

By default, scrapy gets only HTML without executing JS. How can I change it?

Tom&#225;š Linhart · Accepted Answer

If you are about to scrape a page that generates its content dynamically, the first thing to do is to look for an API being called. In your browser's development tools, look for XHR requests in the network tab. For the page you refer to, I can see request for

https://www.ncbi.nlm.nih.gov/genomes/Genome2BE/genome2srv.cgi?action=GetGenomes4Grid&genome_id=971&genome_assembly_id=&king=Bacteria&mode=2&flags=1&page=1&pageSize=100.

If you look in the response, you'll see that it contains the links that are under the FTP column on the page. You can simply use this API to get the information you need.

If you really want to render the page and scrape it, I suggest you use Splash. The best way to integrate it with Scrapy is using scrapy-splash library.

Scrape JS generated content with Scrapy and Python

Answers (1)

Related Questions