Jaffer Wilson
Jaffer Wilson

Reputation: 7273

Using Scrapy to scrape data

I am trying to scrape data using scrapy. But having trouble in editing the code. Here is what I have done as an experiment:

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['http://anon.example.com/']

    def parse(self, response):
        for title in response.css('h2'):
            yield {'Agent-name': title.css('a ::text').extract_first()}

        next_page = response.css('li.col-md-3 ln-t > div.cs-team team-grid > figure > a ::attr(href)').extract_first()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

I have used the example from website scrapy.org and try to modify it. What this code is doing is extracting the names of all the agents from the given page.
But I want that scrapy should go inside the page of the agent and extract its information from there.
Say for example:

Name: name of the agent
Phone: Phone Number
Email: email address
website: URL of website .. etc  

Hope this clarifies my problem. I would like to have a solution for this problem.

Upvotes: 1

Views: 422

Answers (1)

宏杰李
宏杰李

Reputation: 12168

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['http://anon.example.com']


    # get 502 url of name
    def parse(self, response):
        info_urls = response.xpath('//div[@class="text"]//a/@href').extract()
        for info_url in info_urls:
            yield scrapy.Request(url=info_url, callback=self.parse_inof)
    # visit each url and get info
    def parse_inof(self, response):
        info = {}
        info['name'] = response.xpath('//h2/text()').extract_first()
        info['phone'] = response.xpath('//text()[contains(.,"Phone:")]').extract_first()
        info['email'] = response.xpath('//*[@class="cs-user-info"]/li[1]/text()').extract_first()
        info['website'] = response.xpath('//*[@class="cs-user-info"]/li[2]/a/text()').extract_first()
        print(info)

The name can be found in the detail page, so in first step, we just collect all the detail url.

Then we visit all the url and get all the info.

The date may need clean-up, but the idea is clear.

Upvotes: 1

Related Questions