Nyxynyx
Nyxynyx

Reputation: 63619

Using XPath with Scrapy

I am new to using Scrapy and is trying get all the URLs of the listings on the page using Xpath.

The first xpath works

sel.xpath('//[contains(@class, "attraction_element")]')

but the second xpath is giving an error

get_parsed_string(snode_attraction, '//[@class="property_title"]/a/@href')

What is wrong and how can we fix it?

Scrapy Code

def clean_parsed_string(string):
    if len(string) > 0:
        ascii_string = string
        if is_ascii(ascii_string) == False:
            ascii_string = unicodedata.normalize('NFKD', ascii_string).encode('ascii', 'ignore')
        return str(ascii_string)
    else:
        return None


def get_parsed_string(selector, xpath):
    return_string = ''
    extracted_list = selector.xpath(xpath).extract()
    if len(extracted_list) > 0:
        raw_string = extracted_list[0].strip()
        if raw_string is not None:
            return_string = htmlparser.unescape(raw_string)
    return return_string


class TripAdvisorSpider(Spider):
    name = 'tripadvisor'

    allowed_domains = ["tripadvisor.com"]
    base_uri = "http://www.tripadvisor.com"
    start_urls = [
        base_uri + '/Attractions-g155032-Activities-c47-t163-Montreal_Quebec.html'
    ]


    # Entry point for BaseSpider
    def parse(self, response):

        tripadvisor_items = []

        sel = Selector(response)
        snode_attractions = sel.xpath('//[contains(@class, "attraction_element")]')

        # Build item index
        for snode_attraction in snode_attractions:
            print clean_parsed_string(get_parsed_string(snode_attraction, '//[@class="property_title"]/a/@href'))

Upvotes: 2

Views: 766

Answers (1)

alecxe
alecxe

Reputation: 473863

Both are not valid XPath expressions, you need to add the tag names after the //. You can also use a wildcard *:

snode_attractions = sel.xpath('//*[contains(@class, "attraction_element")]')

Note that aside from that you second XPath expression that is used in a loop has to be context specific and start with a dot:

# Build item index
for snode_attraction in snode_attractions:
    print clean_parsed_string(get_parsed_string(snode_attraction, './/*[@class="property_title"]/a/@href'))

Also note that you don't need to instantiate a Selector object and ca use response.xpath() shortcut directly.


Note that a more concise and, arguably, more readable version of the same logic implementation would be to use CSS selectors:

snode_attractions = response.css('.attraction_element')
for snode_attraction in snode_attractions:
    print snode_attraction.css('.property_title > a::attr("href")').extract_first()

Upvotes: 3

Related Questions