Alexander Cohen
Alexander Cohen

Reputation: 71

XPath Syntax - Scrapy

So I've been trying to find a syntax reference guide to finish off a basic screen scraper tool using Scrapy and a Craigslist Jobs site. This is just for practice as I learn about Scrapy more and move into more complex projects - jumping pages, filling out search forms, etc.

This is what my code looks like for Scrapy:

 from scrapy.spider import BaseSpider
 from scrapy.selector import HtmlXPathSelector
 from craigslist_sample.items import CraigslistSampleItem

class MySpider(BaseSpider):
   name = "craig"
   allowed_domains = ["craigslist.org"]
   start_urls = ["https://gainesville.craigslist.org/search/jjj"]

   def parse(self, response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.select("//p")
    items = []
    for titles in titles:
        item = CraigslistSampleItem()
        item ["title"] = titles.select("").extract()
        item ["link"] = titles.select("a/@href").extract()
        items.append(item)
    return items

Obviously as you can see, I have a craiglist sample item python file that contains the elemetns for title and link. I can't seem to figure out line 16: the XPath for the element I am trying to grab - which is the title of the Craiglist posting.

My XPath for the link works, it's under

 <p><a href='URL'> 

in the craigslist posting.

The craigslist posting title is under:

 <p><span><span><a class="hdrlnk">Example Job Description</a>

I've messed around with it and I've been able to get the outputs for title of hdrlnk and also '1'. I'm not sure what Im doing wrong. Any help would be greatly appreciated!

As a bonus, does anyone know how I would then tell Scrapy to go to the next page and run the same script?

Thanks!

Upvotes: 0

Views: 256

Answers (1)

Jithin
Jithin

Reputation: 1712

you can try this out,

BASE_URL = 'https://gainesville.craigslist.org'
titles = response.xpath('//p[@class="row"]')
for title in titles:
    # extracting the title
    name = title.xpath('.//a[@class="hdrlnk"]/text()').extract()
    # cleaning the data
    name = name[0].strip() if name else 'N/A'
    link = title.xpath('.//a[@class="hdrlnk"]/@href').extract()
    link = BASE_URL + link[0].strip() if link else 'N/A
    item = CraigslistSampleItem(title=name, link=link)
    yield items

if you want the pagination, then the complete code will look like,

def parse(self, response):
    BASE_URL = 'https://gainesville.craigslist.org'
    titles = response.xpath('//p[@class="row"]')
    for title in titles:
        # extracting the title
        name = title.xpath('.//a[@class="hdrlnk"]/text()').extract()
        # cleaning the data
        name = name[0].strip() if name else 'N/A'
        link = title.xpath('.//a[@class="hdrlnk"]/@href').extract()
        link = BASE_URL + link[0].strip() if link else 'N/A
        item = CraigslistSampleItem(title=name, link=link)
        yield items

    next_page = response.xpath('//a[@class="button next"]/@href').extract()
    if next_page:
        next_page_url = BASE_URL + next_page[0].strip()
        yield Request(url=next_page_url, callback=self.parse)

Upvotes: 0

Related Questions