Reputation: 3
This is my first question ever in Stack Overflow. I started using Python to scrape data at work and I have been using Scrapy to achieve these tasks. I tried setting up a scraper for a government website and I do not have an output. Initially I set three rules in my rules variable, but my json file would come up empty. The code is fine but I do not know what is going wrong. Thank you for any insight that you are able to share. Have a good day.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class DirSpider(CrawlSpider):
name = 'di7'
allowed_domains = ['transparencia.gob.sv']
start_urls = ['https://www.transparencia.gob.sv/categories/2']
rules = (
Rule(LinkExtractor(restrict_css=".filtrable a"), callback='parse_item', follow=True),
Rule(LinkExtractor(restrict_css="a:nth-of-type(19)"), callback='parse_item', follow=True),
)
def parse(self, response):
items = {}
css_selector = response.css(".spaced .align-justify")
for bureaucrat in css_selector:
name = bureaucrat.css(".medium-11 a::text").extract_first()
charge = bureaucrat.css(".medium-12::text").extract_first()
status = bureaucrat.css(".medium-11 .text-mutted::text").extract_first()
institution = response.css("small::text").extract()
items['name'] = name
items['charge'] = charge
items['status'] = status
items['institution'] = institution
yield(items)```
Upvotes: 0
Views: 139
Reputation: 10666
Try to rename your parse
function into parse_item
:
def parse_item(self, response):
items = {}
css_selector = response.css(".spaced .align-justify")
for bureaucrat in css_selector:
name = bureaucrat.css(".medium-11 a::text").extract_first()
charge = bureaucrat.css(".medium-12::text").extract_first()
status = bureaucrat.css(".medium-11 .text-mutted::text").extract_first()
institution = response.css("small::text").extract()
items['name'] = name
items['charge'] = charge
items['status'] = status
items['institution'] = institution
yield(items)
Upvotes: 1