Reputation: 9
I'm scraping a website looking for paragraphs in a specific place over a large amount of URLs. What I would like to do is record the URL I scraped 'next' to the scraped paragraph in a csv file for each URL I am visiting.
First I am making a list of all the websites I want to scrape using the search syntax for the website. I am searching for books by ISBN number. What I am currently yielding is a list of scraped paragraphs just like I wanted...However it is occasionally not working, and so I can't simply concatenate the scraped paragraphs with the list of ISBNs that I have after the fact because they don't line up perfectly.
I tried putting some code inside the 'yield' brackets to no avail. Any ideas, or other scrapy suggestions?
starts = []
for isbn in data:
starts.append('https://www.********.com/search?q=' + isbn)
import scrapy
from scrapy.crawler import CrawlerProcess
class ESSpider(scrapy.Spider):
name = "ESS"
start_urls = starts
def parse(self, response):
for article in response.xpath('//html'):
yield {
'text': article.xpath('body/div[@class="content"]/div[@class="mainContentContainer "]/div[@class="mainContent "]/div[@class="mainContentFloat "]/div[@class="leftContainer"]/div[@id="topcol"]/div[@id="metacol"]/div[@id="descriptionContainer"]//span/text()').extract(),
}
process = CrawlerProcess({
'FEED_FORMAT': 'csv',
'FEED_URI': 'blurbs2.csv',
'LOG_ENABLED': False,
'ROBOTSTXT_OBEY': True,
'USER_AGENT': ********,
'AUTOTHROTTLE_ENABLED': True,
'HTTPCACHE_ENABLED': True,
'DOWNLOAD_DELAY' : 1
})
process.crawl(ESSpider)
process.start()
Upvotes: 0
Views: 133
Reputation: 10666
If you want to get an URL:
def parse(self, response):
for article in response.xpath('//html'):
item = {
'text': article.xpath('body/div[@class="content"]/div[@class="mainContentContainer "]/div[@class="mainContent "]/div[@class="mainContentFloat "]/div[@class="leftContainer"]/div[@id="topcol"]/div[@id="metacol"]/div[@id="descriptionContainer"]//span/text()').extract(),
'url': response.url,
}
yield item
Upvotes: 1