Reputation: 75
I am a Python novice and am trying to write a script to extract the data from this page. Using scrapy, I wrote the following code:
import scrapy
class dairySpider(scrapy.Spider):
name = "dairy_price"
def start_requests(self):
urls = [
'http://www.dairy.com/market-prices/?page=quote&sym=DAH15&mode=i',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for rows in response.xpath("//tr"):
yield {
'text': rows.xpath(".//td/text()").extract().strip('. \n'),
}
However, this didn't scrape anything. Do you have any ideas ? Thanks
Upvotes: 1
Views: 2578
Reputation: 391
The table on the page http://www.dairy.com/market-prices/?page=quote&sym=DAH15&mode=i is being dynamically added to the DOM
by making request to http://shared.websol.barchart.com/quotes/quote.php?page=quote&sym=DAH15&mode=i&domain=blimling&display_ice=&enabled_ice_exchanges=&tz=0&ed=0.
You should be scrapping the second link instead of first. As scrapy.Request
will only return html source code and not the content added using javascript.
UPDATE
Here is the working code for extracting table data
import scrapy
class dairySpider(scrapy.Spider):
name = "dairy_price"
def start_requests(self):
urls = [
"http://shared.websol.barchart.com/quotes/quote.php?page=quote&sym=DAH15&mode=i&domain=blimling&display_ice=&enabled_ice_exchanges=&tz=0&ed=0",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for row in response.css(".bcQuoteTable tbody tr"):
print row.xpath("td//text()").extract()
Make sure you edit your settings.py
file and change ROBOTSTXT_OBEY = True
to ROBOTSTXT_OBEY = False
Upvotes: 1