Reputation:
I would like to crawl a set of web pages using scrapy. However, when I try to write some values into the json file, those fields don't show up.
Here is my code:
import scrapy
class LLPubs (scrapy.Spider):
name = "linlinks"
start_urls = [
'http://www.linnaeuslink.org/records/record/1',
'http://www.linnaeuslink.org/records/record/2',
]
def parse(self, response):
for container in response.css('div.item'):
yield {
'text': container.css('div.field.soulsbyNo .value span::text').extract(),
'uniformtitle': container.css('div.field.uniformTitle .value span::text').extract(),
'title': container.css('div.field.title .value span::text').extract(),
'opac': container.css('div.field.localControlNo .value span::text').extract(),
'url': container.css('div#digitalLinks li a').extract(),
'partner': container.css('div.logoContainer img:first-child').xpath('@src').extract(),
}
And an example of my output:
{
"text": ["Soulsby no. 46(1)"],
"uniformtitle": ["Systema naturae"],
"title": ["Caroli Linn\u00e6i ... Systema natur\u00e6\nin quo natur\u00e6 regna tria, secundum classes, ordines, genera, species, systematice proponuntur."],
"opac": ["002178079"],
"url": [],
"partner": []
},
I am hoping I am doing something silly and easy to fix! Both of the paths I am using for "url" and "partner" were working from here:
scrapy shell 'http://www.linnaeuslink.org/records/record/1'
So, I just don't know what I am missing.
Oh, and exporting to json by using this command for now:
scrapy crawl linlinks -o quotes.json
Thanks for your help!
Upvotes: 0
Views: 640
Reputation: 1548
The problem seems to be that those selectors are not "findable" inside any div.item
you probably have validated them without the response.css('div.item')
to replicate what you used in the shell just replace the container.css
by response.css
for the url
and partner
keys.
Upvotes: 1