Reputation: 1050
I'm updating this tutorial because it's out of date:
http://mherman.org/blog/2012/11/05/scraping-web-pages-with-scrapy/#.VwpeOfl96Ul
It should fetch the link and title of each job listing on Craigslist for NPOs. The link gets fetched, but the title doesn't.
This is the code of the page for this element:
<span class="pl">
<time datetime="2016-04-09 14:10" title="Sat 09 Apr 02:10:57 PM">Apr 9</time>
<a href="/nby/npo/5531527495.html" data-id="5531527495" class="hdrlnk">
<span id="titletextonly">Therapist</span>
This is the code of the crawler:
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.xpath("//span[@class='pl']")
items = []
for titles in titles:
item = CraigslistSampleItem()
item["title"] = titles.select("a/text()").extract()
item["link"] = titles.select("a/@href").extract()
items.append(item)
return items
If I inspect the element in Chrome and get the XPath, I get this for the titles: //*[@id='titletextonly'], but this gives me the entire list of titles, not just the one for the link (in this case, I should get '/nby/npo/5531527495.html' for link, and 'Therapist' for title)
I know the line
item["title"] = titles.select("a/text()").extract()
needs to be updated, but if I enter //*[@id='titletextonly']
I get every single title so I'm close, but I don't know how to get the XPath for 'titletextonly' within the 'href' element.
I'm new to Scrapy and Xpath so please be kind in your comments.
Thank you.
Upvotes: 1
Views: 977
Reputation: 101652
a/text()
will only select text elements that are direct children of the a
element. The text you want is not a child of the a
element; it's within the span
.
I haven't used scrapy, but I suggest trying this:
item["title"] = titles.select("a").extract()
this should get the string value of the a
element, which would include all of the text inside it.
If that doesn't work, you can also try:
item["title"] = titles.select("a//text()").extract()
Upvotes: 1
Reputation: 758
Change the Xpath as below to traverse upto 'span' tag.
item["title"] = titles.select("a/span/text()").extract()
Upvotes: 1