Priyanka
Priyanka

Reputation: 35

Scrapy: Unable to get the output file in proper format

I'm getting the output as continuous data in rows rather than displayed in proper record format (one record per row).Here's my code :

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class famousPeopleItem(scrapy.Item):
# define the fields for your item here like:
    Name = scrapy.Field()
    Profession = scrapy.Field()
    Birth_Date = scrapy.Field()
    Birth_Place = scrapy.Field()
    Nationality = scrapy.Field()
    Died_On = scrapy.Field()
    # item class included here 
    class famousPeople(CrawlSpider):
    name = 'famous'
     start_urls = [
       'http://www.thefamouspeople.com/famous-people-by-zodiac-sign.php'
        ]
     custom_settings = {
           'DEPTH_LIMIT': '1',
       }
    rules = (
      Rule(LinkExtractor(restrict_xpaths=
     ('//div[@class="table_list"]//a',)),callback='parse_item',follow=True),
    )
    def parse_item(self, response):
     item = famousPeopleItem()
     item["Name"] = 
     response.xpath('//div[@class="section"]//a[2]//text()').extract()
     item["Profession"] = 
     response.xpath('//div[@class="section"]//span//text()').extract()
     item["Birth_Date"] = 
     response.xpath('//div[@class="section"]//p[1]//text()').extract()
     item["Birth_Place"] = 
     response.xpath('//div[@class="section"]//p[2]//text()').extract()
     item["Nationality"] = 
     response.xpath('//div[@class="section"]//p[3]//text()').extract()
     item["Died_On"] = 
     response.xpath('//div[@class="section"]//p[4]//text()').extract()
     yield (item)

Though extract_first() helps in providing data in proper format but then it doesn't fetch all the records.

Upvotes: 0

Views: 97

Answers (2)

Done Data Solutions
Done Data Solutions

Reputation: 2286

extract() returns the scraped data as a list of (unicode) strings. If you want all the data and not only the first element, you can join the results into one string like this:

SEPARATOR = ' '

item["Name"] = SEPARATOR.join(response.xpath('//div[@class="section"]//a[2]//text()').extract())
# ... and so on

(I'm assuming here it's ok to separate the pieces with just a space - if a different separator like "|" or "," is more suitable for your purpose adjust it).

If you want to do more complex extraction operations like filtering for particular pieces, stripping etc, I suggest you have a look at Scrapy's item loaders: https://doc.scrapy.org/en/latest/topics/loaders.html

Upvotes: 0

Frank Martin
Frank Martin

Reputation: 2594

For getting one record per row you need to yield one item per person.

Currently you yield one (big) item where all data is fetched into your fields. This is because your XPath selector spans all persons on a page.

Instead of response.xpath('//div[@class="section"]') you need a selector which spans single persons. Search the html code for a suitable tag. It looks like tile is much more promising.

Then you should loop over that new selector and make your item XPaths relative to the parent selector by starting with a dot. Finally yield one item per person.

Pseudo code looks like that:

def parse_item(self, response):
    sel_persons = response.xpath('//div[@class="tile"]')
    for sel_person in sel_persons:
        # ...
        item['Name'] = sel_person.xpath('.//a[2]//text()').extract_first()
        # ...
        yield item

See also the documentation of scrapy and the section Working with relative XPaths

Upvotes: 1

Related Questions