Scrapy Data Table extract

Question

I am trying to scrape "https://www.expireddomains.net/deleted-com-domains/" for the expired domain data list.

I always get empty item fields for the following

class ExpiredSpider(BaseSpider):

    name = "expired"
    allowed_domains = ["example.com"]
    start_urls = ['https://www.expireddomains.net/deleted-com-domains/']

    def parse(self, response):
        log.msg('parse(%s)' % response.url, level = log.DEBUG)
        rows = response.xpath('//table[@class="base1"]/tbody/tr')
        for row in rows:
            item = DomainItem()
            item['domain'] = row.xpath('td[1]/text()').extract()
            item['bl'] = row.xpath('td[2]/text()').extract()
            yield item

Can somebody point out what is wrong? Thanks.

Rafael Almeida · Accepted Answer

As a first note, you should use scrapy.Spider instead of BaseSpider which is deprecated

Secondly, .extract() method returns a list rather than a single element. This is how the item extraction should look like

item['domain'] = row.xpath('td[1]/text()').extract_first()
item['bl'] = row.xpath('td[2]/text()').extract_first()

Also,

You should use the built in python logging library

import logging
logging.debug("parse("+response.url+")")

Scrapy Data Table extract

Answers (1)

Related Questions