Reputation: 35
I'm getting the output as continuous data in rows rather than displayed in proper record format (one record per row).Here's my code :
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class famousPeopleItem(scrapy.Item):
# define the fields for your item here like:
Name = scrapy.Field()
Profession = scrapy.Field()
Birth_Date = scrapy.Field()
Birth_Place = scrapy.Field()
Nationality = scrapy.Field()
Died_On = scrapy.Field()
# item class included here
class famousPeople(CrawlSpider):
name = 'famous'
start_urls = [
'http://www.thefamouspeople.com/famous-people-by-zodiac-sign.php'
]
custom_settings = {
'DEPTH_LIMIT': '1',
}
rules = (
Rule(LinkExtractor(restrict_xpaths=
('//div[@class="table_list"]//a',)),callback='parse_item',follow=True),
)
def parse_item(self, response):
item = famousPeopleItem()
item["Name"] =
response.xpath('//div[@class="section"]//a[2]//text()').extract()
item["Profession"] =
response.xpath('//div[@class="section"]//span//text()').extract()
item["Birth_Date"] =
response.xpath('//div[@class="section"]//p[1]//text()').extract()
item["Birth_Place"] =
response.xpath('//div[@class="section"]//p[2]//text()').extract()
item["Nationality"] =
response.xpath('//div[@class="section"]//p[3]//text()').extract()
item["Died_On"] =
response.xpath('//div[@class="section"]//p[4]//text()').extract()
yield (item)
Though extract_first() helps in providing data in proper format but then it doesn't fetch all the records.
Upvotes: 0
Views: 97
Reputation: 2286
extract()
returns the scraped data as a list of (unicode) strings. If you want all the data and not only the first element, you can join the results into one string like this:
SEPARATOR = ' '
item["Name"] = SEPARATOR.join(response.xpath('//div[@class="section"]//a[2]//text()').extract())
# ... and so on
(I'm assuming here it's ok to separate the pieces with just a space - if a different separator like "|" or "," is more suitable for your purpose adjust it).
If you want to do more complex extraction operations like filtering for particular pieces, stripping etc, I suggest you have a look at Scrapy's item loaders: https://doc.scrapy.org/en/latest/topics/loaders.html
Upvotes: 0
Reputation: 2594
For getting one record per row you need to yield one item per person.
Currently you yield one (big) item where all data is fetched into your fields. This is because your XPath selector spans all persons on a page.
Instead of response.xpath('//div[@class="section"]')
you need a selector which spans single persons. Search the html code for a suitable tag. It looks like tile
is much more promising.
Then you should loop over that new selector
and make your item XPaths relative to the parent selector by starting with a dot. Finally yield one item per person.
Pseudo code looks like that:
def parse_item(self, response):
sel_persons = response.xpath('//div[@class="tile"]')
for sel_person in sel_persons:
# ...
item['Name'] = sel_person.xpath('.//a[2]//text()').extract_first()
# ...
yield item
See also the documentation of scrapy and the section Working with relative XPaths
Upvotes: 1