Apple
Apple

Reputation: 55

How to get a good result from scrapy

I am trying to scrape the details from a wikipedia using scrapy. I was able to scrape it but i get a very messy and poor result. since I am new to python and scrapy, I am having difficulty on fixing this.

here's my code:

from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

from wikipedia.items import WikipediaItem

class WikipediaSpider(BaseSpider):
    name = "wiki"
    allowed_domains = ["wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Main_Page"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//table[@id="mp-upper"]/tr')
        items = []
        for site in sites:
            item = WikipediaItem()
            item['title'] = site.select('.//a/text()').extract()
            item['link'] = site.select('.//a/@href').extract()
            item['details'] = site.select('.//p/text()').extract()
            items.append(item)
        return items

and this is the result:

2013-04-19 02:18:48+0800 [wiki] DEBUG: Scraped from <200 http://en.wikipedia.org/wiki/Main_Page>

{'details': [u' is a fungal species found in moist habitats in ',

u'. The species produces brown ',
                 u' with ',

                 u' of varying shapes up to 40 millimetres (1.6\xa0in) across, and tall, thin ',

                 u' up to 62 millimetres (2.4\xa0in) long, at the base of which is a large and well-defined "bulb". The stem varies in colour, with whitish, pale yellow-brown, pale red-brown, pale brown and grey-brown all observed. The species produces unusually shaped, irregular ',

                 u', each with a few thick protrusions. This feature helps differentiate it from other species that would otherwise be similar in appearance and ',

                 u'. It grows in ',

                 u' association with ',

                 u', and it is for this that the species is named. However, particular species favoured by the fungus are unclear and may include ',

                 u' and ',

                 u' taxa. The mushrooms grow from the ground, often among mosses or ',

                 u'. The species was first described in 2009, and within the genus ',

                 u', it is a part of the ',

                 u' ',

                 u'. The ',

                 u' ',

                 u' was collected from the shore of a lake near ',

                 u', Finland. The species has also been recorded in Sweden and, at 
least in some areas, it is relatively common. (',

                 u')',

                 u'Recently featured: ',

                 u'\xa0\u2013 ',

                 u'\xa0\u2013 ',

                 u': ',

                 u' ',

                 u' ',

                 u'More anniversaries: ',

                 u' ',

                 u' '],

     'link': [u'/wiki/File:Inocybe_saliceticola.jpg',

              u'/wiki/Inocybe_saliceticola',

              u'/wiki/Nordic_countries',

              u'/wiki/Mushrooms',

              u'/wiki/Pileus_(mycology)',

              u'/wiki/Stipe_(mycology)',

              u'/wiki/Spore',

              u'/wiki/Habit_(biology)',

              u'/wiki/Mycorrhizal',

              u'/wiki/Willow',

              u'/wiki/Beech',

              u'/wiki/Alder',

              u'/wiki/Detritus',

              u'/wiki/Section_(botany)',

              u'/wiki/Holotype',

              u'/wiki/Nurmes',

              u'/wiki/Inocybe_saliceticola',

              u'/wiki/Thistle,_Utah',

              u'/wiki/Be_Here_Now_(album)',

              u'/wiki/Sumatran_rhinoceros',

              u'/wiki/Wikipedia:Today%27s_featured_article/April_2013',

              u'https://lists.wikimedia.org/mailman/listinfo/daily-article-l',

              u'/wiki/Wikipedia:Featured_articles',

              u'/wiki/Wikipedia:Recent_additions',

              u'/wiki/File:Ezra_Meeker_1921_crop.jpg',

              u'/wiki/Ezra_Meeker',

              u'/wiki/Oregon_Trail',

              u'/wiki/Bullock_cart',

              u'/wiki/Italy_at_the_2009_Mediterranean_Games',

              u'/wiki/2009_Mediterranean_Games_medal_table',

              u'/wiki/Cossack_hetman',

              u'/wiki/Ivan_Petrizhitsky-Kulaga',

              u'/wiki/Cossacks',

              u'/wiki/Fokus_(magazine)',

              u'/wiki/Amir_Garrett',

              u'/wiki/College_basketball',


              u'/wiki/Fastball',

              u'/wiki/Armenian_Genocide',

              u'/wiki/Karin_dialect',

              u'/wiki/Scottish_American',

              u'/wiki/Daniel_Pennie_House',

              u'/wiki/Wikipedia:Recent_additions',

              u'/wiki/Wikipedia:Your_first_article',

              u'/wiki/Template_talk:Did_you_know',

              u'/wiki/Slang',

              u'/wiki/Hammer',

              u'/wiki/Church_(building)',

              u'/wiki/Wikipedia:Today%27s_articles_for_improvement',

              u'/wiki/File:2013_Boston_Marathon_aftermath_people.jpg',

              u'/wiki/West_fertilizer_plant_explosion',

              u'/wiki/West,_Texas',

              u'/wiki/Texas',

              u'/wiki/Moment_magnitude_scale',

              u'/wiki/2013_Sistan_and_Baluchestan_earthquake',

              u'/wiki/Sistan_and_Baluchestan_Province',

              u'/wiki/15_April_2013_Iraq_attacks',

              u'/wiki/Boston_Marathon_bombings',

              u'/wiki/2013_Boston_Marathon',

              u'/wiki/Death_and_state_funeral_of_Hugo_Ch%C3%A1vez',

              u'/wiki/Nicol%C3%A1s_Maduro',

              u'/wiki/Venezuelan_presidential_election,_2013',

              u'/wiki/List_of_Presidents_of_Venezuela',

              u'/wiki/Adam_Scott_(golfer)',

              u'/wiki/2013_Masters_Tournament',

              u'/wiki/Government_of_India',

              u'/wiki/Bollywood',

              u'/wiki/Pran',

              u'/wiki/Dadasaheb_Phalke_Award',

              u'/wiki/Deaths_in_2013',

              u'/wiki/Colin_Davis',

              u'/wiki/Maria_Tallchief',

              u'/wiki/Jonathan_Winters',

              u'//en.wikinews.org/wiki/Main_Page',

              u'/wiki/Portal:Current_events',

              u'/wiki/April_18',

              u'/wiki/File:Stpetes.JPG',

              u'/wiki/1506',

              u'/wiki/St._Peter%27s_Basilica',

              u'/wiki/Vatican_City',

              u'/wiki/Old_St._Peter%27s_Basilica',

              u'/wiki/1689',

              u'/wiki/Militia_(United_States)',

              u'/wiki/Boston',

              u'/wiki/1689_Boston_revolt',

              u'/wiki/Dominion_of_New_England',

              u'/wiki/1923',

              u'/wiki/New_York_Yankees',

              u'/wiki/Major_League_Baseball',

              u'/wiki/Yankee_Stadium_(1923)',

              u'/wiki/1938',

              u'/wiki/Superman',

              u'/wiki/Jerry_Siegel',

              u'/wiki/Joe_Shuster',

              u'/wiki/Action_Comics_1',

              u'/wiki/Superhero',

              u'/wiki/Comic_book',

              u'/wiki/1947',

              u'/wiki/List_of_the_largest_artificial_non-nuclear_explosions',

              u'/wiki/Royal_Navy',

              u'/wiki/Tonne',

              u'/wiki/Ammunition',

              u'/wiki/Heligoland',

              u'/wiki/1949',

              u'/wiki/Republic_of_Ireland',

              u'/wiki/Commonwealth_of_Nations',

              u'/wiki/1996',

              u'/wiki/1996_shelling_of_Qana',

              u'/wiki/Qana',

              u'/wiki/Operation_Grapes_of_Wrath',

              u'/wiki/United_Nations_Interim_Force_in_Lebanon',

              u'/wiki/April_17',

              u'/wiki/April_18',

              u'/wiki/April_19',

              u'/wiki/Wikipedia:Selected_anniversaries/April',

              u'https://lists.wikimedia.org/mailman/listinfo/daily-article-l',

              u'/wiki/List_of_historical_anniversaries',

              u'/wiki/Coordinated_Universal_Time',

              u'//en.wikipedia.org/w/index.php?title=Main_Page&action=purge'],
 'title': [u'Inocybe saliceticola',

 u'Nordic countries',

               u'mushrooms',

               u'caps',

               u'stems',

               u'spores',

               u'habit',

               u'mycorrhizal',

               u'willow',

               u'beech',

               u'alder',

               u'detritus',

               u'section',

               u'holotype',

               u'Nurmes',

               u'Thistle, Utah',

               u'Be Here Now',

               u'Sumatran rhinoceros',

               u'Archive'

               u'List of historical anniversaries',

               u'UTC',

               u'Reload this page']}

Upvotes: 2

Views: 1167

Answers (1)

Robin
Robin

Reputation: 9644

I can't access the same page you did, but the result you obtain is probably so erratic because wikipedia text is so full of links. When you do site.select('.//p/text()'), you only select the text which is directly under the node <p>. Which means that what's inside the subnodes <a href=..>text</a> isn't scraped. The links tags split the result, so you end up with a strange list.

If you want to retrieve every node you can use

contents = site.select('.//p/node()').extract()
item['details'] = ''.join(contents)

That way you'll have everything inside the <p> tags (including the <a>tags). If you only want the text without the links tags you can then use strip_html(item['details']) (actually, contents = site.select('.//p//text()').extract() might work as well and be more xpath oriented).

Upvotes: 2

Related Questions