vcovo
vcovo

Reputation: 336

Scraping based on "nested property"

After having created a few different spiders I thought I could scrape practically anything, but I've hit a roadblock.

Given the following code snippet:

<div class="col-md-4">
    <div class="tab-title">Homepage</div>
    <p>
        <a target="_blank" rel="nofollow" 
         href="http://www.bitcoin.org">http://www.bitcoin.org
        </a>
    </p>
</div>

How would you go about selecting the link that is in within <a ... </a> based on the text within the tab-title div?

The reason that I require that condition is because there are several other links that fit this condition:

response.css('div.col-md-4 a::attr(href)').extract()

My best guess is the following:

response.css('div.col-md-4 div.tab-title:contains("Homepage") a::attr(href)').extract()

Any insights are appreciated! Thank you in advance.

Note: I am using Scrapy.

Upvotes: 0

Views: 636

Answers (1)

Tom&#225;š Linhart
Tom&#225;š Linhart

Reputation: 10220

How about this using XPath:

response.xpath('//div[@class="tab-title" and contains(., "Homepage")]/..//a/@href')

Find a div with class tab-title which contains Homepage inside, then step up to the parent and look for a child on any level.

EDIT: Using CSS, you should be able to do it like this:

response.css('div.tab-title:contains("Homepage") ~ * a::attr(href)')

Upvotes: 2

Related Questions