Reputation: 5
So I created a scrapy spider to extract data from a site e.g. https://www.sportstoto.com.my/result_print.asp?drawNo=5291/21
Here's my code,
import scrapy
from totoprintasp.items import Result
def generate_start_urls():
drawNums = ['5291/21']
return ['https://www.sportstoto.com.my/result_print.asp?drawNo={}'.format(drawNum) for drawNum in drawNums]
class TotoprintSpider(scrapy.Spider):
name = 'totoprint'
allowed_domains = ['www.sportstoto.com.my/result_print.asp']
start_urls = generate_start_urls()
download_delay = 3
def parse(self, response):
# print(response.body)
items = []
# print(response.body)
for each in response.xpath("/html/body/div/center/table/tbody"):
item = Result()
drawDate = each.xpath(
"tr[2]/td/div/table/tbody/tr/td[1]/span/font/b/text()").extract()
drawNo = each.xpath(
"tr[2]/td/div/table/tbody/tr/td[2]/span/b/font/text()").extract()
gameType = each.xpath(
"tr[4]/td/span/font/text()").extract()
firstPrize = each.xpath(
"tr[5]/td/table[1]/tbody/tr[2]/td[1]/span/b/font/text()").extract()
item['drawDate'] = drawDate
item['drawNo'] = drawNo
item['gameType'] = gameType
item['firstPrize'] = firstPrize
items.append(item)
yield item
It didn't extract anything. I am running the command,
scrapy runspider totoprint.py
and have set, the values,
FEED_URI = 'results.json'
FEED_FORMAT = 'json'
in my settings.py
file
So the results should be written to the json file
Funny thing nothing appears and nothing got extract. I've tried different variation, even changed .extract()
to .get()
The XPath works as I've tried it on my XPath helper extension in my chrome browser.
Appreciate some help or suggestions.
Upvotes: 0
Views: 42
Reputation: 716
I rewrite your script but you have to refix it according to your own item. The problem here you're looking for 1 tbody
with their 1 child. But there a lot of tbody
.
As I understand you want gameType as a list and others as a string. I get the following output:
|------------------|-----------------|----------------------------------------|------------|
| drawDate | drawNo | gameType | firstPrize |
|------------------|-----------------|----------------------------------------|------------|
| Date:30/05/2021 | DrawNo. 5291/21 | TOTO 4D,TOTO 4D ZODIAC,TOTO 5D,TOTO 6D | 4800 |
|------------------|-----------------|----------------------------------------|------------|
By the way, you don't have to do a for loop for each URL. Each URL calling the parse one by one. So here is the script:
import scrapy
def generate_start_urls():
drawNums = ['5291/21']
return ['https://www.sportstoto.com.my/result_print.asp?drawNo={}'.format(drawNum) for drawNum in drawNums]
class TotoprintSpider(scrapy.Spider):
name = 'totoprint'
allowed_domains = ['www.sportstoto.com.my/result_print.asp']
start_urls = generate_start_urls()
download_delay = 3
custom_settings = {
"ROBOTSTXT_OBEY":False, #You have to close the robotstxt rule because they are not letting you in.
}
def parse(self, response):
drawDate,drawNo = response.xpath('//*[@class="dataDD"]//text()').extract() #Both have same class so you can scrape them together
gameType = response.xpath('//*[@class="tit4D"]//text()').extract()
firstPrize = response.xpath('(//*[@class="dataResultA"])[1]//text()').get() #According to your scrit you want just first price because of that I write the xpath with [1]
yield {
'drawDate':drawDate.replace("\t","").replace("\n","").replace("\r",""), #There was some issue about t,n,r I delete simply with replace
"drawNo":drawNo.replace("\t","").replace("\n","").replace("\r",""),
"gameType":gameType,
"firstPrize":firstPrize
}
I think the script I write is what you want.
Upvotes: 1