Scrapy: Unable to get the output file in proper format

Question

I'm getting the output as continuous data in rows rather than displayed in proper record format (one record per row).Here's my code :

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class famousPeopleItem(scrapy.Item):
# define the fields for your item here like:
    Name = scrapy.Field()
    Profession = scrapy.Field()
    Birth_Date = scrapy.Field()
    Birth_Place = scrapy.Field()
    Nationality = scrapy.Field()
    Died_On = scrapy.Field()
    # item class included here 
    class famousPeople(CrawlSpider):
    name = 'famous'
     start_urls = [
       'http://www.thefamouspeople.com/famous-people-by-zodiac-sign.php'
        ]
     custom_settings = {
           'DEPTH_LIMIT': '1',
       }
    rules = (
      Rule(LinkExtractor(restrict_xpaths=
     ('//div[@class="table_list"]//a',)),callback='parse_item',follow=True),
    )
    def parse_item(self, response):
     item = famousPeopleItem()
     item["Name"] = 
     response.xpath('//div[@class="section"]//a[2]//text()').extract()
     item["Profession"] = 
     response.xpath('//div[@class="section"]//span//text()').extract()
     item["Birth_Date"] = 
     response.xpath('//div[@class="section"]//p[1]//text()').extract()
     item["Birth_Place"] = 
     response.xpath('//div[@class="section"]//p[2]//text()').extract()
     item["Nationality"] = 
     response.xpath('//div[@class="section"]//p[3]//text()').extract()
     item["Died_On"] = 
     response.xpath('//div[@class="section"]//p[4]//text()').extract()
     yield (item)

Though extract_first() helps in providing data in proper format but then it doesn't fetch all the records.

Frank Martin · Accepted Answer

For getting one record per row you need to yield one item per person.

Currently you yield one (big) item where all data is fetched into your fields. This is because your XPath selector spans all persons on a page.

Instead of response.xpath('//div[@class="section"]') you need a selector which spans single persons. Search the html code for a suitable tag. It looks like tile is much more promising.

Then you should loop over that new selector and make your item XPaths relative to the parent selector by starting with a dot. Finally yield one item per person.

Pseudo code looks like that:

def parse_item(self, response):
    sel_persons = response.xpath('//div[@class="tile"]')
    for sel_person in sel_persons:
        # ...
        item['Name'] = sel_person.xpath('.//a[2]//text()').extract_first()
        # ...
        yield item

See also the documentation of scrapy and the section Working with relative XPaths

Scrapy: Unable to get the output file in proper format

Answers (2)

Related Questions