Reputation: 19
So basically I want to pull the parts under the tr-mfgPartNumber class from this html but have problems.
first I thought it was my syntax for calling each class but still no output
Tried adding another for loop to go to the whole body class if anyone can check if my code has an error in the the way im calling the classes that would be great!
import scrapy
class DigiSpider(scrapy.Spider):
name = 'digi'
allowed_domains = ['digikey.com']
start_urls = ['https://www.digikey.com/products/en/integrated-circuits-ics/memory/774?FV=-1%7C428%2C-8%7C774%2C7%7C1/']
def parse(self, response):
data={}
parts=response.css('tbody.InkPart')
for part in parts:
for p in part.css('td.tr-mfgPartNumber'):
data['href'] = p.css('a::attr(href)').extract()
yield data
Below is the HTML
<tbody id="lnkPart" cookie-tracking="ref_page_event=Select Part;available_parameters=["s","pv1989","pv142","pv2042","pv2192","pv276","pv252","pv16","pv1291"];">
<tr>
<td class="tr-compareParts" align="center">
<input type="checkbox" name="part" value="428-3574-2-ND" id="428-3574-2-ND" onclick="partClick();">
<label title="Compare Parts" for="428-3574-2-ND"></label>
</td>
<td class="tr-datasheet">
<a class="lnkDatasheet" href="https://www.cypress.com/file/43021/download" target="_blank" track-data="ref_page_event=Display Asset;page_title=Datasheet;asset_type=Datasheet">
<img class="datasheet-img" src="//www.digikey.com/Web%20Export/Common/icons/datasheet.png" alt="CY62157EV30LL-45ZSXIT Datasheet" title="CY62157EV30LL-45ZSXIT Datasheet">
</a>
</td>
<td class="tr-image">
<a href="/product-detail/en/cypress-semiconductor-corp/CY62157EV30LL-45ZSXIT/428-3574-2-ND/1205268">
<img class="pszoomer" zoomimg="//media.digikey.com/Renders/Cypress%20Semi%20Renders/428;51-85087;Z,ZS;44.jpg" border="0" height="64" src="//media.digikey.com/Renders/Cypress%20Semi%20Renders/428;51-85087;Z,ZS;44_tmb.jpg" alt="CY62157EV30LL-45ZSXIT - Cypress Semiconductor Corp" title="CY62157EV30LL-45ZSXIT - Cypress Semiconductor Corp">
</a>
</td>
<td class="tr-dkPartNumber nowrap-culture">
<a href="/product-detail/en/cypress-semiconductor-corp/CY62157EV30LL-45ZSXIT/428-3574-2-ND/1205268">
428-3574-2-ND
</a>
<div class="product-indicator-collection">
<a class="align-indicator-collection" href="javascript:msgBox('#dlgRohs');">
<img class="rohs-foilage" src="//www.digikey.com/web%20export/common/mkt/en/leaf.png" border="0" alt="This part is RoHS compliant." title="This part is RoHS compliant.">
</a>
</div>
</td>
<td class="tr-mfgPartNumber">
<a href="/product-detail/en/cypress-semiconductor-corp/CY62157EV30LL-45ZSXIT/428-3574-2-ND/1205268">
<span>CY62157EV30LL-45ZSXIT</span>
</a>
</td>
Upvotes: 0
Views: 76
Reputation: 50
When I tried the same code, scrapy was getting empty response. Maybe the site was detecting and blocking the spider. After using user agent, it worked.
Here's the code below (I also changed "tbody.InkPart" to "tbody#lnkPart", it was a syntax mistake in your code, though it is not needed since there's only one tbody tag):
import scrapy
class DigiSpider(scrapy.Spider):
name = 'digi'
allowed_domains = ['digikey.com']
custom_settings = {
"USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"
}
start_urls = ['https://www.digikey.com/products/en/integrated-circuits-ics/memory/774?FV=-1%7C428%2C-8%7C774%2C7%7C1/']
def parse(self, response):
data={}
parts=response.css('tbody#lnkPart')
for part in parts:
for p in part.css('td.tr-mfgPartNumber'):
data['href'] = p.css('a::attr(href)').extract()
yield data
Upvotes: 1