Reputation: 71
So I've been trying to find a syntax reference guide to finish off a basic screen scraper tool using Scrapy and a Craigslist Jobs site. This is just for practice as I learn about Scrapy more and move into more complex projects - jumping pages, filling out search forms, etc.
This is what my code looks like for Scrapy:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["craigslist.org"]
start_urls = ["https://gainesville.craigslist.org/search/jjj"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//p")
items = []
for titles in titles:
item = CraigslistSampleItem()
item ["title"] = titles.select("").extract()
item ["link"] = titles.select("a/@href").extract()
items.append(item)
return items
Obviously as you can see, I have a craiglist sample item python file that contains the elemetns for title and link. I can't seem to figure out line 16: the XPath for the element I am trying to grab - which is the title of the Craiglist posting.
My XPath for the link works, it's under
<p><a href='URL'>
in the craigslist posting.
The craigslist posting title is under:
<p><span><span><a class="hdrlnk">Example Job Description</a>
I've messed around with it and I've been able to get the outputs for title of hdrlnk and also '1'. I'm not sure what Im doing wrong. Any help would be greatly appreciated!
As a bonus, does anyone know how I would then tell Scrapy to go to the next page and run the same script?
Thanks!
Upvotes: 0
Views: 256
Reputation: 1712
you can try this out,
BASE_URL = 'https://gainesville.craigslist.org'
titles = response.xpath('//p[@class="row"]')
for title in titles:
# extracting the title
name = title.xpath('.//a[@class="hdrlnk"]/text()').extract()
# cleaning the data
name = name[0].strip() if name else 'N/A'
link = title.xpath('.//a[@class="hdrlnk"]/@href').extract()
link = BASE_URL + link[0].strip() if link else 'N/A
item = CraigslistSampleItem(title=name, link=link)
yield items
if you want the pagination, then the complete code will look like,
def parse(self, response):
BASE_URL = 'https://gainesville.craigslist.org'
titles = response.xpath('//p[@class="row"]')
for title in titles:
# extracting the title
name = title.xpath('.//a[@class="hdrlnk"]/text()').extract()
# cleaning the data
name = name[0].strip() if name else 'N/A'
link = title.xpath('.//a[@class="hdrlnk"]/@href').extract()
link = BASE_URL + link[0].strip() if link else 'N/A
item = CraigslistSampleItem(title=name, link=link)
yield items
next_page = response.xpath('//a[@class="button next"]/@href').extract()
if next_page:
next_page_url = BASE_URL + next_page[0].strip()
yield Request(url=next_page_url, callback=self.parse)
Upvotes: 0