How to crawl and scrape one set of data from multiple linked pages with Scrapy

Question

What I am trying to do is to scrape company information (thisisavailable.eu.pn/company.html) and add to the board dict all the board members with their respective data from separate pages.

So ideally the data that I get back from sample pages would be:

{
    "company": "Mycompany Ltd",
    "code": "3241234",
    "phone": "2323232",
    "email": "info@mycompany.com",
    "board": {
        "1": {
            "name": "Margaret Sawfish",
            "code": "9999999999"
        },
        "2": {
            "name": "Ralph Pike",
            "code": "222222222"
        }
    }
}

I have searched Google and SO (like here and here and Scrapy docs etc) but have not been able to find a solution for problem exactly like this.

What I have been able to cobble together:

items.py:

import scrapy
class company_item(scrapy.Item):
    name = scrapy.Field()
    code = scrapy.Field()
    board = scrapy.Field()
    phone = scrapy.Field()
    email = scrapy.Field()
    pass

class person_item(scrapy.Item):
    name = scrapy.Field()
    code = scrapy.Field()    
    pass

spiders/example.py:

import scrapy
from try.items import company_item,person_item

class ExampleSpider(scrapy.Spider):
    name = "example"
    #allowed_domains = ["http://thisisavailable.eu.pn"]
    start_urls = ['http://thisisavailable.eu.pn/company.html']

    def parse(self, response):
        if response.xpath("//table[@id='company']"):
            yield self.parse_company(response)
            pass
        elif response.xpath("//table[@id='person']"):
            yield self.parse_person(response)
            pass        
        pass

    def parse_company(self, response):
        Company = company_item()
        Company['name'] = response.xpath("//table[@id='company']/tbody/tr[1]/td[2]/text()").extract_first()
        Company['code'] = response.xpath("//table[@id='company']/tbody/tr[2]/td[2]/text()").extract_first()
        board = []
         
        for person_row in response.xpath("//table[@id='board']/tbody/tr/td[1]"):
            Person = person_item()
            Person['name'] = person_row.xpath("a/text()").extract()
            print (person_row.xpath("a/@href").extract_first())
            request = scrapy.Request('http://thisisavailable.eu.pn/'+person_row.xpath("a/@href").extract_first(), callback=self.parse_person)
            request.meta['Person'] = Person
            return request          
            board.append(Person)

        Company['board'] = board
        return Company      

    def parse_person(self, response):       
        print('PERSON!!!!!!!!!!!')
        print (response.meta)
        Person = response.meta['Person']
        Person['name'] = response.xpath("//table[@id='person']/tbody/tr[1]/td[2]/text()").extract_first()
        Person['code'] = response.xpath("//table[@id='person']/tbody/tr[2]/td[2]/text()").extract_first()
        yield Person

UPDATE: As Rafael noticed, initial problem was with allowed_domains being wrong - I commented it out for the time being and now when I run it, I get (added *'s to URLs due to low rep):

scrapy crawl example 2017-03-07 09:41:12 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: proov) 2017-03-07 09:41:12 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'proov.spiders', 'SPIDER_MODULES': ['proov.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'proov'} 2017-03-07 09:41:12 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2017-03-07 09:41:13 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-03-07 09:41:13 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-03-07 09:41:13 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-03-07 09:41:13 [scrapy.core.engine] INFO: Spider opened 2017-03-07 09:41:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-03-07 09:41:13 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-03-07 09:41:14 [scrapy.core.engine] DEBUG: Crawled (404) (referer: None) 2017-03-07 09:41:14 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) person.html person2.html 2017-03-07 09:41:15 [scrapy.core.engine] DEBUG: Crawled (200) http://thisisavailable.eu.pn/person2.html> (referer: http://*thisisavailable.eu.pn/company.html) PERSON!!!!!!!!!!! 2017-03-07 09:41:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://*thisisavailable.eu.pn/person2.html> {'code': u'222222222', 'name': u'Kaspar K\xe4nnuotsa'} 2017-03-07 09:41:15 [scrapy.core.engine] INFO: Closing spider (finished) 2017-03-07 09:41:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 936, 'downloader/request_count': 3, 'downloader/request_method_count/GET': 3, 'downloader/response_bytes': 1476, 'downloader/response_count': 3, 'downloader/response_status_count/200': 2, 'downloader/response_status_count/404': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 3, 7, 7, 41, 15, 571000), 'item_scraped_count': 1, 'log_count/DEBUG': 5, 'log_count/INFO': 7, 'request_depth_max': 1, 'response_received_count': 3, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2017, 3, 7, 7, 41, 13, 404000)} 2017-03-07 09:41:15 [scrapy.core.engine] INFO: Spider closed (finished)

and if run with "-o file.json", file content is:

[ {"code": "222222222", "name": "Ralph Pike"} ]

So a bit further, but I am still at loss how to make it work.

Can somebody help me make this work?

How to crawl and scrape one set of data from multiple linked pages with Scrapy

Answers (1)

Related Questions