Qing Yong
Qing Yong

Reputation: 127

Trying to crawl values using scrapy

I am trying to crawl the 'Median Gross Monthly Income From Work' from a webpage using the following code:

class crawl_income(scrapy.Spider):

     name = "salary"
     allowed_domains = ["stats.mom.gov.sg"]
     url = 'http://stats.mom.gov.sg/Pages/Income-Summary-Table.aspx'

     def parse_data(self, response):
         table_headers = response.xpath('//tr[@class="odd"]/td/td')
         salary = []
         for value in table_headers:
             data = value.xpath('.//text()').extract()
             salary.append(data)
         print salary

process = CrawlerProcess()
process.crawl(crawl_income)

process.start()

But I do not see any values when I was trying to print out the list that I created to store the values.

Where did I go wrong?

Upvotes: 1

Views: 218

Answers (1)

GHajba
GHajba

Reputation: 3691

Firs of all, your code won't work.

url should be start_urls to let Scrapy know where to start crawling.

parse_data should be parse because without any information Scrapy does not know which method to call and the default is parse. Otherwise you get a NotImplementedError too when Scrapy crawls the start URL and the parse method is not present.

When I run the code below (which holds all the mentioned changes) and prints the response.body to the console I do not find any element with class="odd" so I guess there are some AJAX/XHR calls inside the site which then provide the information.

EDIT

After looking at your code again I see that the XPath is a bit odd. You use tr[@class="odd"]/td/td however one td element does not have another td as its child. If you want to avoid the headers change your extraction as in the code below. With this change I get results in the salary list.

import scrapy
from scrapy.crawler import CrawlerProcess

class crawl_income(scrapy.Spider):

    name = "salary"
    allowed_domains = ["stats.mom.gov.sg"]
    start_urls = ['http://stats.mom.gov.sg/Pages/Income-Summary-Table.aspx']

    def parse(self, response):
        print response.body
        table_headers = response.xpath('//tr[@class="odd"]//td')
        salary = []
        for value in table_headers[1:]:
            data = value.xpath('./text()').extract()
            salary.append(data)
        print salary

process = CrawlerProcess()
process.crawl(crawl_income)

process.start()

Upvotes: 3

Related Questions