Reputation: 11

Scraping website using python & scrapy

I am new to Scrapy (& Python!), and I am trying to scrap the commentary from the Cricinfo website. Here is an example of a webpage: http://www.espncricinfo.com/champions-league-twenty20-2014/engine/match/763595.html?innings=1;view=commentary

I'm interested in scraping the over numbers (0.1 for eg), and the text next to it.

Using Firebug I can see that the xpath of the "0.1" is: /html/body/div[2]/div[3]/div[4]/div[5]/div/div[3]/table/tbody/tr/td[2]/div/table/tbody/tr[2]/td[1]/p

and the text next to it is: /html/body/div[2]/div[3]/div[4]/div[5]/div/div[3]/table/tbody/tr/td[2]/div/table/tbody/tr[2]/td[2]/p

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from crictest.items import CrictestItem

class MySpider(BaseSpider):
    name = "cricinfo"
    allowed_domains = ["espncricinfo.com/"]
    start_urls = ["http://www.espncricinfo.com/champions-league-twenty20-2014/engine/match/763595.html?innings=1;view=commentary/"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//html/body/div[2]/div[3]/div[4]/div[5]/div/div[3]/table/tbody/tr/td[2]/div/table/tbody/tr')
        items =[]
        for row in rows:
            item = CrictestItem()
            item['overnum'] = row.select('td[1]/p/text()').extract()
            item['overnumtext'] = row.select('td[2]/p/text()').extract()
            items.append(item)
        return items

I am trying to loop through the rows (/tr) then return td[1]/p/text and then td[2]/p/text My items.py looks like:

import scrapy


class CrictestItem(scrapy.Item):
    overnum = scrapy.Field()
    overnumtext = scrapy.Field()

Using scrapy crawl cricinfo -o items.csv -t csv it just gives me a items.csv file with no data in it at all.

Where am I going wrong? Any help would be greatly appreciated.

Upvotes: 0

Answers (2)

Anandhakumar R

Reputation: 391

You can get the exact result from the following example.

Use the python next sibling to get the appropriate result.

The Html code is:

<div id="provider-region-addresses">
<h3>Contact details</h3>
<h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>More information</dt>
            <dd>North Shore Hospital</dd><dt>Physical address</dt>
                <dd>124 Shakespeare Rd, Takapuna, Auckland 0620</dd><dt>Postal address</dt>
                <dd>Private Bag 93503, Takapuna, Auckland 0740</dd><dt>Postcode</dt>
                <dd>0740</dd><dt>District/town</dt>

                <dd>
                North Shore, Takapuna</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 486 8996</dd><dt>Fax</dt>
                <dd>(09) 486 8342</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    <h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>Physical address</dt>
                <dd>Helensville</dd><dt>Postal address</dt>
                <dd>PO Box 13, Helensville 0840</dd><dt>Postcode</dt>
                <dd>0840</dd><dt>District/town</dt>

                <dd>
                Rodney, Helensville</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 420 9450</dd><dt>Fax</dt>
                <dd>(09) 420 7050</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    <h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>Physical address</dt>
                <dd>Warkworth</dd><dt>Postal address</dt>
                <dd>PO Box 505, Warkworth 0941</dd><dt>Postcode</dt>
                <dd>0941</dd><dt>District/town</dt>

                <dd>
                Rodney, Warkworth</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 422 2700</dd><dt>Fax</dt>
                <dd>(09) 422 2709</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    <h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>More information</dt>
            <dd>Waitakere Hospital</dd><dt>Physical address</dt>
                <dd>55-75 Lincoln Rd, Henderson, Auckland 0610</dd><dt>Postal address</dt>
                <dd>Private Bag 93115, Henderson, Auckland 0650</dd><dt>Postcode</dt>
                <dd>0650</dd><dt>District/town</dt>

                <dd>
                Waitakere, Henderson</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 839 0000</dd><dt>Fax</dt>
                <dd>(09) 837 6634</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    <h2 class="toggler nohide">Auckland</h2>
    <dl class="clear">
        <dt>More information</dt>
            <dd>Hibiscus Coast Community Health Centre</dd><dt>Physical address</dt>
                <dd>136 Whangaparaoa Rd, Red Beach 0932</dd><dt>Postcode</dt>
                <dd>0932</dd><dt>District/town</dt>

                <dd>
                Rodney, Red Beach</dd><dt>Region</dt>
                <dd>Auckland</dd><dt>Phone</dt>
                <dd>(09) 427 0300</dd><dt>Fax</dt>
                <dd>(09) 427 0391</dd><dt>Website</dt>
                <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
    </dl>
    </div>

Spider code is:

def parse(self, response):
        hxs = HtmlXPathSelector(response)

        practice = hxs.select('//h1/text()').extract()
        items1 = []

        results = hxs.select('//*[@id="content"]/div[@class="content"]/div/dl')
        for result in results:
            item = WebhealthItem1()
            #item['url'] = result.select('//dl/a/@href').extract()
            item['practice'] = practice
            item['hours'] = map(unicode.strip,
                result.select('dt[contains(.," Contact hours")]/following-sibling::dd[1]/text()').extract())
            item['more_hours'] = map(unicode.strip,
                result.select('dt[contains(., "More information")]/following-sibling::dd[1]/text()').extract())
            item['physical_address'] = map(unicode.strip,
                result.select('dt[contains(., "Physical address")]/following-sibling::dd[1]/text()').extract())
            item['postal_address'] = map(unicode.strip,
                result.select('dt[contains(., "Postal address")]/following-sibling::dd[1]/text()').extract())
            item['postcode'] = map(unicode.strip,
                result.select('dt[contains(., "Postcode")]/following-sibling::dd[1]/text()').extract())
            item['district_town'] = map(unicode.strip,
                result.select('dt[contains(., "District/town")]/following-sibling::dd[1]/text()').extract())
            item['region'] = map(unicode.strip,
                result.select('dt[contains(., "Region")]/following-sibling::dd[1]/text()').extract())
            item['phone'] = map(unicode.strip,
                result.select('dt[contains(., "Phone")]/following-sibling::dd[1]/text()').extract())
            item['website'] = map(unicode.strip,
                result.select('dt[contains(., "Website")]/following-sibling::dd[1]/a/@href').extract())
            item['email'] = map(unicode.strip,
                result.select('dt[contains(., "Email")]/following-sibling::dd[1]/a/text()').extract())
            items1.append(item)
        return items1

Upvotes: 0

alecxe

Reputation: 474241

The xpath you have is not correct and, besides, is very fragile.

As far as I understand, you need the numbers in bold and the text next to them. I'd rely on the td elements with battingComms class:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from crictest.items import CrictestItem


class MySpider(BaseSpider):
    name = "cricinfo"
    allowed_domains = ["espncricinfo.com/"]
    start_urls = ["http://www.espncricinfo.com/champions-league-twenty20-2014/engine/match/763595.html?innings=1;view=commentary/"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//td[@class="battingComms" and b]')
        for row in rows:
            item = CrictestItem()
            item['overnum'] = row.select('b/text()').extract()[0]
            item['overnumtext'] = row.select('b/following-sibling::text()').extract()[0]
            yield item

Outputs on the console:

{'overnum': u'0.4',
 'overnumtext': u" bingo! that's a good ol slog from van Wyk right across the line of a good length ball that nips back in. No bat involved, but loads of timber. Lovely bowling from Paris and he knows it "}
{'overnum': u'1.3',
 'overnumtext': u' and dies by his reputation. Behrendorff is assisted by some swing away, Delport flings his bat at with all his might and only ends up with an outside edge that is pouched behind the wicket. Brilliant catch from Whiteman as he leaps to his left and stretches as high as he could '}
...

Upvotes: 1

Scraping website using python &amp; scrapy

Answers (2)

Related Questions

Scraping website using python & scrapy