Reputation: 889
I'm trying to webscrape https://celulares.mercadolibre.com.ar/ with Scrapy 1.4.0. What I want to obtain is a list with description of the product along with the img src of that product. The problem is that when I run my spider it returns appropiately just the first 4 items (description + corresponding img src) and the rest of the item list is just the description with "none" img src. By analyzing the webpage source code I can tell that the only difference between the first 5 items and the rest is that the class attribute of the first ones it's called "lazy-load" while the other ones have an special id like "ML2178321". But considering that I don't refer the class name in the spider code I don't understand why the behaviour changes in this last items. I suspect about some JQuery/JS thing that I'm not aware. Here's the code of one of the first item containers:
<div class="image-content">
<a href="https://articulo.mercadolibre.com.ar/MLA-644049024-samsung-galaxy-j7-prime-lector-de-huella16gb3gb-ram-_JM" class="figure item-image item__js-link">
<img alt="Samsung Galaxy J7 Prime Lector De Huella+16gb+3gb Ram" src="https://http2.mlstatic.com/samsung-celulares-smartphones-D_Q_NP_771296-MLA25977210113_092017-X.jpg" class="lazy-load" srcset="https://http2.mlstatic.com/samsung-celulares-smartphones-D_Q_NP_771296-MLA25977210113_092017-X.jpg 1x, https://http2.mlstatic.com/samsung-celulares-smartphones-D_NQ_NP_771296-MLA25977210113_092017-V.jpg 2x" width="160" height="160">
</a>
</div>
And here the code of the container from one of the later images (the ones that return "None" img src):
<div class="image-content">
<a href="https://articulo.mercadolibre.com.ar/MLA-643729195-motorola-moto-g4-4ta-gen-4g-lte-16gb-ram-2gb-libre-gtia-_JM" class="figure item-image item__js-link">
<img alt="Motorola Moto G4 4ta Gen 4g Lte 16gb Ram 2gb Libre Gtia" id="MLA643729195-I" srcset="https://http2.mlstatic.com/motorola-celulares-smartphones-D_Q_NP_765168-MLA26028117832_092017-X.jpg 1x, https://http2.mlstatic.com/motorola-celulares-smartphones-D_NQ_NP_765168-MLA26028117832_092017-V.jpg 2x" src="https://http2.mlstatic.com/motorola-celulares-smartphones-D_Q_NP_765168-MLA26028117832_092017-X.jpg" width="160" height="160">
</a>
</div>
import scrapy
import time
class MlarSpider(scrapy.Spider):
name = "mlar"
allowed_domains = ["mercadolibre.com.ar"]
start_urls = ['https://celulares.mercadolibre.com.ar/']
def parse(self, response):
SET_SELECTOR = '.results-item'
for item in response.css(SET_SELECTOR):
PRODUCTO_SELECTOR = '.item__info-title span ::text'
IMAGEN_SELECTOR = '.image-content a img'
yield {
'producto': item.css(PRODUCTO_SELECTOR).extract_first(),
'imagen': item.css(IMAGEN_SELECTOR).xpath("@src").extract_first(),
}
NEXT_PAGE_SELECTOR = '.pagination__next a::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)
I've implemented the Barmar comment and got it working like a charm. Just added this lines to my spider:
IXPATH= '@src'
if item.css(IMAGEN_SELECTOR).xpath(IXPATH).extract_first() is None:
IXPATH = '@data-src'
yield {
'producto': item.css(PRODUCTO_SELECTOR).extract_first(),
'imagen': item.css(IMAGEN_SELECTOR).xpath(IXPATH).extract_first(),
}
Upvotes: 2
Views: 2148
Reputation: 781096
There's no src
attribute in the later images. Here's the code of that image:
<img width='160' height='160' alt='Motorola Moto G4 4ta Gen 4g Lte 16gb Ram 2gb Libre Gtia' id='MLA643729195-I' class='loading' title='https://http2.mlstatic.com/motorola-celulares-smartphones-D_Q_NP_765168-MLA26028117832_092017-X.webp' data-src='https://http2.mlstatic.com/motorola-celulares-smartphones-D_Q_NP_765168-MLA26028117832_092017-X.webp' data-srcset='https://http2.mlstatic.com/motorola-celulares-smartphones-D_Q_NP_765168-MLA26028117832_092017-X.webp 1x, https://http2.mlstatic.com/motorola-celulares-smartphones-D_NQ_NP_765168-MLA26028117832_092017-V.webp 2x' />
The image URL is in the data-src
attribute, not src
.
The site is using a lazy loading plugin that waits for the user to scroll an image into the viewport before setting the src
. At that time it copies the data-src
attribute to src
. What you posted is apparently the DOM element after this has happened, not the original HTML source, which is what scrapy
sees.
You could simply change your script to look for data-src
attributes if it can't find a src
attribute.
Upvotes: 1