Reputation: 336
After having created a few different spiders I thought I could scrape practically anything, but I've hit a roadblock.
Given the following code snippet:
<div class="col-md-4">
<div class="tab-title">Homepage</div>
<p>
<a target="_blank" rel="nofollow"
href="http://www.bitcoin.org">http://www.bitcoin.org
</a>
</p>
</div>
How would you go about selecting the link that is in within <a ... </a>
based on the text within the tab-title
div?
The reason that I require that condition is because there are several other links that fit this condition:
response.css('div.col-md-4 a::attr(href)').extract()
My best guess is the following:
response.css('div.col-md-4 div.tab-title:contains("Homepage") a::attr(href)').extract()
Any insights are appreciated! Thank you in advance.
Note: I am using Scrapy.
Upvotes: 0
Views: 636
Reputation: 10220
How about this using XPath:
response.xpath('//div[@class="tab-title" and contains(., "Homepage")]/..//a/@href')
Find a div
with class tab-title
which contains Homepage
inside, then step up to the parent and look for a
child on any level.
EDIT: Using CSS, you should be able to do it like this:
response.css('div.tab-title:contains("Homepage") ~ * a::attr(href)')
Upvotes: 2