Reputation: 7273
I am trying to scrape data using scrapy. But having trouble in editing the code. Here is what I have done as an experiment:
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://anon.example.com/']
def parse(self, response):
for title in response.css('h2'):
yield {'Agent-name': title.css('a ::text').extract_first()}
next_page = response.css('li.col-md-3 ln-t > div.cs-team team-grid > figure > a ::attr(href)').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
I have used the example from website scrapy.org and try to modify it. What this code is doing is extracting the names of all the agents from the given page.
But I want that scrapy should go inside the page of the agent and extract its information from there.
Say for example:
Name: name of the agent
Phone: Phone Number
Email: email address
website: URL of website .. etc
Hope this clarifies my problem. I would like to have a solution for this problem.
Upvotes: 1
Views: 422
Reputation: 12168
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://anon.example.com']
# get 502 url of name
def parse(self, response):
info_urls = response.xpath('//div[@class="text"]//a/@href').extract()
for info_url in info_urls:
yield scrapy.Request(url=info_url, callback=self.parse_inof)
# visit each url and get info
def parse_inof(self, response):
info = {}
info['name'] = response.xpath('//h2/text()').extract_first()
info['phone'] = response.xpath('//text()[contains(.,"Phone:")]').extract_first()
info['email'] = response.xpath('//*[@class="cs-user-info"]/li[1]/text()').extract_first()
info['website'] = response.xpath('//*[@class="cs-user-info"]/li[2]/a/text()').extract_first()
print(info)
The name
can be found in the detail page, so in first step, we just collect all the detail url.
Then we visit all the url and get all the info.
The date may need clean-up, but the idea is clear.
Upvotes: 1