Reputation: 95
I am going to scrape NFL depth charts from web pages, e.g. https://www.ourlads.com/nfldepthcharts/archive/220/BUF . Now I want to get all the links to this kind of pages, but the source code of the dropdown menu "Archive Dates" does not include any links:
<option value="">-- Archive Dates --</option>
<option value="220">05/01/2019</option>
<option value="219">04/01/2019</option>
<option value="218">03/01/2019</option>
<option value="217">02/01/2019</option>
<option value="216">01/01/2019</option>
<option value="215">12/01/2018</option>
<option value="214">11/01/2018</option>
<option value="213">10/01/2018</option>
<option value="212">09/01/2018</option>
<option value="211">08/01/2018</option>
I read a post Web scrape get drop-down menu data python , which I think is helpful since he indicated that the web page uses JavaScript.
But that answer uses selenium. I wonder if I can solve the problem using scrapy or beautifulsoup.
The following is the structure of my scraper.
class depth_chart_archive_spider(scrapy.Spider):
name = "depth_chart_archive"
start_urls = ('https://www.ourlads.com/nfldepthcharts/',)
def parse(self, response):
dchome = BeautifulSoup(response.body, 'html.parser')
# get the links somehow
for link in links:
yield scrapy.Request(link, callback = self.parse_team)
def parse_team(self, response):
# parse the page
Upvotes: 1
Views: 1646
Reputation: 675
You can build the URL using the value
parameter found on each option
tag.
For example, the menu that refer to 05/01/2019 has value=220
in the option
tag
<option value="220">05/01/2019</option>
The url opened when you click on this menu is:
https://www.ourlads.com/nfldepthcharts/archive/220/BUF
So it follows a pattern, you can request all items using something like:
site_url = 'https://www.ourlads.com/nfldepthcharts/archive/{code}/BUF'
for code in response.xpath('//option/@value').re(r'\d+'):
yield Request(site_url.format(code=code))
The regex is only to avoid requesting the first item <option value="">-- Archive Dates --</option>
Upvotes: 2