Tensigh
Tensigh

Reputation: 1050

How to get item using XPath in scrapy

I'm updating this tutorial because it's out of date:
http://mherman.org/blog/2012/11/05/scraping-web-pages-with-scrapy/#.VwpeOfl96Ul

It should fetch the link and title of each job listing on Craigslist for NPOs. The link gets fetched, but the title doesn't.

This is the code of the page for this element:

<span class="pl"> 
  <time datetime="2016-04-09 14:10" title="Sat 09 Apr 02:10:57 PM">Apr 9</time> 
  <a href="/nby/npo/5531527495.html" data-id="5531527495" class="hdrlnk">
  <span id="titletextonly">Therapist</span>

This is the code of the crawler:

    def parse(self, response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.xpath("//span[@class='pl']")
    items = []
    for titles in titles:
        item = CraigslistSampleItem()
        item["title"] = titles.select("a/text()").extract()
        item["link"] = titles.select("a/@href").extract()
        items.append(item)
    return items

If I inspect the element in Chrome and get the XPath, I get this for the titles: //*[@id='titletextonly'], but this gives me the entire list of titles, not just the one for the link (in this case, I should get '/nby/npo/5531527495.html' for link, and 'Therapist' for title)

I know the line

item["title"] = titles.select("a/text()").extract()

needs to be updated, but if I enter //*[@id='titletextonly'] I get every single title so I'm close, but I don't know how to get the XPath for 'titletextonly' within the 'href' element.

I'm new to Scrapy and Xpath so please be kind in your comments.

Thank you.

Upvotes: 1

Views: 977

Answers (2)

JLRishe
JLRishe

Reputation: 101652

a/text() will only select text elements that are direct children of the a element. The text you want is not a child of the a element; it's within the span.

I haven't used scrapy, but I suggest trying this:

item["title"] = titles.select("a").extract()

this should get the string value of the a element, which would include all of the text inside it.

If that doesn't work, you can also try:

item["title"] = titles.select("a//text()").extract()

Upvotes: 1

Srikanth Nakka
Srikanth Nakka

Reputation: 758

Change the Xpath as below to traverse upto 'span' tag.

item["title"] = titles.select("a/span/text()").extract()

Upvotes: 1

Related Questions