Reputation: 127
I am trying to crawl the 'Median Gross Monthly Income From Work' from a webpage using the following code:
class crawl_income(scrapy.Spider):
name = "salary"
allowed_domains = ["stats.mom.gov.sg"]
url = 'http://stats.mom.gov.sg/Pages/Income-Summary-Table.aspx'
def parse_data(self, response):
table_headers = response.xpath('//tr[@class="odd"]/td/td')
salary = []
for value in table_headers:
data = value.xpath('.//text()').extract()
salary.append(data)
print salary
process = CrawlerProcess()
process.crawl(crawl_income)
process.start()
But I do not see any values when I was trying to print out the list that I created to store the values.
Where did I go wrong?
Upvotes: 1
Views: 218
Reputation: 3691
Firs of all, your code won't work.
url
should be start_urls
to let Scrapy know where to start crawling.
parse_data
should be parse
because without any information Scrapy does not know which method to call and the default is parse
. Otherwise you get a NotImplementedError
too when Scrapy crawls the start URL and the parse
method is not present.
When I run the code below (which holds all the mentioned changes) and prints the response.body
to the console I do not find any element with class="odd"
so I guess there are some AJAX/XHR calls inside the site which then provide the information.
EDIT
After looking at your code again I see that the XPath is a bit odd. You use tr[@class="odd"]/td/td
however one td
element does not have another td
as its child. If you want to avoid the headers change your extraction as in the code below. With this change I get results in the salary
list.
import scrapy
from scrapy.crawler import CrawlerProcess
class crawl_income(scrapy.Spider):
name = "salary"
allowed_domains = ["stats.mom.gov.sg"]
start_urls = ['http://stats.mom.gov.sg/Pages/Income-Summary-Table.aspx']
def parse(self, response):
print response.body
table_headers = response.xpath('//tr[@class="odd"]//td')
salary = []
for value in table_headers[1:]:
data = value.xpath('./text()').extract()
salary.append(data)
print salary
process = CrawlerProcess()
process.crawl(crawl_income)
process.start()
Upvotes: 3