Reputation: 19
Basicallly I am putting this data I extracted into a csv file but there some problems with the format.
-First only the parts get displayed nothing else is displayed fg. Quantity and Price -Secondly the column headers seem to repeating down rows.
I would like for the parts, prices, quantity to be displayed down different columns and the headers would be the names. If anyone could just tell me where I can learn to do this that would help a lot!
name = 'digi'
allowed_domains = ['digikey.com']
custom_settings = {
"USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"
}
start_urls = ['https://www.digikey.com/products/en/integrated-circuits-ics/memory/774?FV=-1%7C428%2C-8%7C774%2C7%7C1&quantity=0&ColumnSort=0&page=1&k=cy621&pageSize=500&pkeyword=cy621']
def parse(self, response):
data={}
parts=response.css('Table#productTable.productTable')
for part in parts:
for p in part.css('tbody#lnkPart'):
yield {
'Part': p.css('td.tr-mfgPartNumber span::text').extract(),
'Quantity': p.css('td.tr-minQty.ptable-param span.desktop::text').extract(),
'Price': p.css('td.tr-unitPrice.ptable-param span::text').extract()
}
SETTINGS
BOT_NAME = 'website1'
SPIDER_MODULES = ['website1.spiders']
NEWSPIDER_MODULE = 'website1.spiders'
#Export as CSV Feed
#FEED_EXPORT_FIELDS: ["parts", "quantity", "price"]
FEED_FORMAT = "csv"
FEED_URI = "parts.csv"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'website1 (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
Upvotes: 1
Views: 58
Reputation: 2335
Are you getting the correct data when you test in Scrapy shell ? It's worth trying out your selectors in scrapy shell before commiting them to a script.
I've not looked in detail into your CSS selectors but there's a lot of for loops for essentially all you need to do is loop over the tr's. So finding a CSS selector that gets you all the rows instead of looping over the whole table and work your way down is probably more efficient.
Update:
Since you asked about the for loop
for p in response.css('tbody#lnkPart > tr'):
yield {
'Part': p.css('td.tr-mfgPartNumber span::text').get(),
'Quantity': p.css('td.tr-minQty.ptable-param span.desktop::text').get(),
'Price': p.css('td.tr-unitPrice.ptable-param span::text').get()
}
Note we only need to loop around the tr's this selects all of them. The get() method only selects the item with the specific tr.
Note you'll need to think about how you'll handle space and none items. Worth think over this part carefully and coming up with a simple way to modify the results.
Updated Code
def parse(self, response):
for p in response.css('tbody#lnkPart > tr'):
if p.css('td.tr-minQty.ptable-param span.desktop::text').get():
quantity = p.css('td.tr-minQty.ptable-param span.desktop::text').get()
quantity = quantity.strip()
cleaned_quantity = int(quantity.replace(',',''))
else:
quantity = 'No quantity'
if p.css('td.tr-unitPrice.ptable-param span::text').get():
price = p.css('td.tr-unitPrice.ptable-param span::text').get()
cleaned_price = price.strip()
else:
price = 'No Price'
yield {
'Part': p.css('td.tr-mfgPartNumber span::text').get(),
'Quantity': cleaned_quantity,
'Price': cleaned_price
}
Upvotes: 1