How to scrape the links hidden in the dropdown menu using scrapy?

Question

I am going to scrape NFL depth charts from web pages, e.g. https://www.ourlads.com/nfldepthcharts/archive/220/BUF . Now I want to get all the links to this kind of pages, but the source code of the dropdown menu "Archive Dates" does not include any links:

I read a post Web scrape get drop-down menu data python , which I think is helpful since he indicated that the web page uses JavaScript.

But that answer uses selenium. I wonder if I can solve the problem using scrapy or beautifulsoup.

The following is the structure of my scraper.

class depth_chart_archive_spider(scrapy.Spider):
    name = "depth_chart_archive"
    start_urls = ('https://www.ourlads.com/nfldepthcharts/',)

    def parse(self, response):
        dchome = BeautifulSoup(response.body, 'html.parser')

        # get the links somehow

        for link in links:
            yield scrapy.Request(link, callback = self.parse_team)

    def parse_team(self, response):
        # parse the page

Marcos · Accepted Answer

You can build the URL using the value parameter found on each option tag.

For example, the menu that refer to 05/01/2019 has value=220 in the option tag

The url opened when you click on this menu is:

https://www.ourlads.com/nfldepthcharts/archive/220/BUF

So it follows a pattern, you can request all items using something like:

site_url = 'https://www.ourlads.com/nfldepthcharts/archive/{code}/BUF'

for code in response.xpath('//option/@value').re(r'\d+'):
    yield Request(site_url.format(code=code))

The regex is only to avoid requesting the first item

How to scrape the links hidden in the dropdown menu using scrapy?

Answers (1)

Related Questions