yeopcp
yeopcp

Reputation: 5

Unable to extract data from website via scrapy but works with xpath helper extension

So I created a scrapy spider to extract data from a site e.g. https://www.sportstoto.com.my/result_print.asp?drawNo=5291/21

Here's my code,

    import scrapy
from totoprintasp.items import Result


def generate_start_urls():
    drawNums = ['5291/21']
    return ['https://www.sportstoto.com.my/result_print.asp?drawNo={}'.format(drawNum) for drawNum in drawNums]


class TotoprintSpider(scrapy.Spider):
    name = 'totoprint'
    allowed_domains = ['www.sportstoto.com.my/result_print.asp']
    start_urls = generate_start_urls()
    download_delay = 3

    def parse(self, response):
        # print(response.body)

        items = []
        # print(response.body)
        for each in response.xpath("/html/body/div/center/table/tbody"):
            item = Result()
            drawDate = each.xpath(
                "tr[2]/td/div/table/tbody/tr/td[1]/span/font/b/text()").extract() 
            drawNo = each.xpath(
                "tr[2]/td/div/table/tbody/tr/td[2]/span/b/font/text()").extract()
            gameType = each.xpath(
                "tr[4]/td/span/font/text()").extract()
            firstPrize = each.xpath(
                "tr[5]/td/table[1]/tbody/tr[2]/td[1]/span/b/font/text()").extract()

            item['drawDate'] = drawDate
            item['drawNo'] = drawNo
            item['gameType'] = gameType
            item['firstPrize'] = firstPrize
            items.append(item)
            yield item

It didn't extract anything. I am running the command, scrapy runspider totoprint.py and have set, the values,

FEED_URI = 'results.json'

FEED_FORMAT = 'json'

in my settings.py file

So the results should be written to the json file

Funny thing nothing appears and nothing got extract. I've tried different variation, even changed .extract() to .get()

The XPath works as I've tried it on my XPath helper extension in my chrome browser.

enter image description here

Appreciate some help or suggestions.

Upvotes: 0

Views: 42

Answers (1)

Murat Demir
Murat Demir

Reputation: 716

I rewrite your script but you have to refix it according to your own item. The problem here you're looking for 1 tbody with their 1 child. But there a lot of tbody.

As I understand you want gameType as a list and others as a string. I get the following output:

|------------------|-----------------|----------------------------------------|------------|
| drawDate         | drawNo          | gameType                               | firstPrize |
|------------------|-----------------|----------------------------------------|------------|
| Date:30/05/2021  | DrawNo. 5291/21 | TOTO 4D,TOTO 4D ZODIAC,TOTO 5D,TOTO 6D | 4800       |
|------------------|-----------------|----------------------------------------|------------|

By the way, you don't have to do a for loop for each URL. Each URL calling the parse one by one. So here is the script:

import scrapy

def generate_start_urls():
    drawNums = ['5291/21']
    return ['https://www.sportstoto.com.my/result_print.asp?drawNo={}'.format(drawNum) for drawNum in drawNums]

class TotoprintSpider(scrapy.Spider):
    name = 'totoprint'
    allowed_domains = ['www.sportstoto.com.my/result_print.asp']
    start_urls = generate_start_urls()
    download_delay = 3
    custom_settings = { 
        "ROBOTSTXT_OBEY":False, #You have to close the robotstxt rule because they are not letting you in.
    }

    def parse(self, response):
        drawDate,drawNo = response.xpath('//*[@class="dataDD"]//text()').extract() #Both have same class so you can scrape them together
        gameType = response.xpath('//*[@class="tit4D"]//text()').extract()
        firstPrize = response.xpath('(//*[@class="dataResultA"])[1]//text()').get() #According to your scrit you want just first price because of that I write the xpath with [1]
        yield {
            'drawDate':drawDate.replace("\t","").replace("\n","").replace("\r",""), #There was some issue about t,n,r I delete simply with replace
            "drawNo":drawNo.replace("\t","").replace("\n","").replace("\r",""),
            "gameType":gameType,
            "firstPrize":firstPrize
        }

I think the script I write is what you want.

Upvotes: 1

Related Questions