Reputation: 80
import scrapy
import pandas as pd
from ..items import HomedepotpricespiderItem
from scrapy.http import Request
class HomedepotspiderSpider(scrapy.Spider):
name = 'homeDepotSpider'
allowed_domains = ['homedepot.com']
start_urls = ['https://www.homedepot.com/pep/304660691']#.format(omsID = omsID)
#for omsID in omsList]
def parse(self, response):
#call home depot function
for item in self.parseHomeDepot(response):
yield item
pass
def parseHomeDepot(self, response):
#get top level item
items = response.css('#zone-a-product')
for product in items:
item = HomedepotpricespiderItem()
#get SKU
productSKU = product.css('.product-info-bar__detail:nth-child(2)::text').getall()
#get rid of all the stuff i dont need
#productSKU = [x.strip(' ') for x in productSKU] #whiteSpace
#productSKU = [x.strip(',') for x in productSKU]
#productSKU = [x.strip('\n') for x in productSKU]
#productSKU = [x.strip('\t') for x in productSKU]
#productSKU = [x.strip(' Model# ') for x in productSKU] #gets rid of the model name
So my selectors are fine and they get the correct fields.
When running with the strip lines commented out I get 'Model #,RA30'
then when I run my program with the strip commands not commented out I get ,RA30
Im running my program using this command in terminal: scrapy crawl homeDepotSpider -t csv -o - > "/Users/userName/Desktop/homeDepotv2Helpers/homeDepotTest.csv"
and the output I have above is copied directly from the CSV
Edit*
I've also tried this
productSKU = [x.replace(' ,', '') for x in productSKU]
and that didn't work. Also this is the direct output from terminal {'productSKU': ['', 'RA30']}
Upvotes: 0
Views: 139
Reputation: 28266
Your selector gives you a list of two elements: ['Model #', 'RA30']
.
To get only the SKU, simply use indexing:
productSKU = product.css('.product-info-bar__detail:nth-child(2)::text').getall()[1]
If there's a chance that a product won't have an SKU, make sure to handle exceptions correctly.
Upvotes: 2
Reputation: 10666
Why don't you want to use XPath + regex?
product_model = response.xpath('//h2[@class="product-info-bar__detail"][contains(., "Model #")]/text()').re_first(r'#(.+)')
Upvotes: 1
Reputation: 1201
The strip function will only remove signs or substrings at the beginning or end of a string. If you want to remove a character no matter where in the string, use the replace function.
However, if you only want to remove the comma in the beginning or at the end of your string, you should repeat your line productSKU = [x.strip(',') for x in productSKU]
once more after roductSKU = [x.strip(' Model# ') for x in productSKU]
Upvotes: 2