Reputation: 549
I with to retrieve data from the script tags of multiple urls with Regex. I have a csv file('links.csv ') that contains all the urls I'll need to scrape. I managed to read the csv and store all the urls in the variable named 'start_urls'. My problem is that I need a way to read the urls from 'start_urls' one at a time and execute the next part of my code. When I execute my code in the terminal and it returns 2 errors:
1. for pvi_subtype_name,pathIndicator.depth_5,model_name in zip(source): ValueError: not enough values to unpack (expected 3, got 1)
2. source = response.xpath("//script[contains(., 'COUNTRY_SHOP_STATUS')]/text()").extract()[0] IndexError: list index out of range
Here are some examples of urls I stored in the initial csv('links.csv'):
"https://www.samsung.com/uk/smartphones/galaxy-note8/"
"https://www.samsung.com/uk/smartphones/galaxy-s8/"
"https://www.samsung.com/uk/smartphones/galaxy-s9/"
Here is my code:
import scrapy
import csv
import re
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
with open('links.csv','r') as csvf:
for url in csvf:
yield scrapy.Request(url.strip())
def parse(self, response):
source = response.xpath("//script[contains(., 'COUNTRY_SHOP_STATUS')]/text()").extract()[0]
def get_values(parameter, script):
return re.findall('%s = "(.*)"' % parameter, script)[0]
with open('baza.csv', 'w') as csvfile:
fieldnames = ['Category', 'Type', 'SK']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for pvi_subtype_name,pathIndicator.depth_5,model_name in zip(source):
writer.writerow({'Category': get_values("pvi_subtype_name", source), 'Type': get_values("pathIndicator.depth_5", source), 'SK': get_values("model_name", source)})
Upvotes: 1
Views: 66
Reputation: 1285
The site for the S9 has a different structure to S8 so there will be always an error because in S9 you don't find COUNTRY_SHOP_STATUS.
Using the csv-writer directly is not scrapy like. You overwrite your result many times. Because you open a new csv-File for every product. If you realy want to do it that way. Open the csv file in start_requests and append in parse. But have a look into item pipelines. I remove the loop with the zip because parse is already at the lowest level.
import scrapy
import csv
import re
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
with open('so_52069753.csv','r') as csvf:
urlreader = csv.reader(csvf, delimiter=',',quotechar='"')
for url in urlreader:
if url[0]=="y":
yield scrapy.Request(url[1])
with open('so_52069753_out.csv', 'w') as csvfile:
fieldnames = ['Category', 'Type', 'SK']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
def parse(self, response):
def get_values(parameter, script):
return re.findall('%s = "(.*)"' % parameter, script)[0]
source_arr = response.xpath("//script[contains(., 'COUNTRY_SHOP_STATUS')]/text()").extract()
if source_arr:
source = source_arr[0]
#yield ({'Category': get_values("pvi_subtype_name", source), 'Type': get_values("pathIndicator.depth_5", source), 'SK': get_values("model_name", source)})
with open('so_52069753_out.csv', 'a') as csvfile:
fieldnames = ['Category', 'Type', 'SK']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'Category': get_values("pvi_subtype_name", source), 'Type': get_values("pathIndicator.depth_5", source), 'SK': get_values("model_name", source)})
I changed input csv_file (so_52069753.csv) as well:
y,https://www.samsung.com/uk/smartphones/galaxy-note8/
y,https://www.samsung.com/uk/smartphones/galaxy-s8/
y,https://www.samsung.com/uk/smartphones/galaxy-s9/
So it's possible to configure if a url is processed or not.
Upvotes: 1