How to get item using XPath in scrapy

Question

I'm updating this tutorial because it's out of date:
http://mherman.org/blog/2012/11/05/scraping-web-pages-with-scrapy/#.VwpeOfl96Ul

It should fetch the link and title of each job listing on Craigslist for NPOs. The link gets fetched, but the title doesn't.

This is the code of the page for this element:

 
  Apr 9 
  
  Therapist

This is the code of the crawler:

    def parse(self, response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.xpath("//span[@class='pl']")
    items = []
    for titles in titles:
        item = CraigslistSampleItem()
        item["title"] = titles.select("a/text()").extract()
        item["link"] = titles.select("a/@href").extract()
        items.append(item)
    return items

If I inspect the element in Chrome and get the XPath, I get this for the titles: //*[@id='titletextonly'], but this gives me the entire list of titles, not just the one for the link (in this case, I should get '/nby/npo/5531527495.html' for link, and 'Therapist' for title)

I know the line

item["title"] = titles.select("a/text()").extract()

needs to be updated, but if I enter //*[@id='titletextonly'] I get every single title so I'm close, but I don't know how to get the XPath for 'titletextonly' within the 'href' element.

I'm new to Scrapy and Xpath so please be kind in your comments.

Thank you.

Srikanth Nakka · Accepted Answer

Change the Xpath as below to traverse upto 'span' tag.

item["title"] = titles.select("a/span/text()").extract()

How to get item using XPath in scrapy

Answers (2)

Related Questions