Alejandro Lorefice
Alejandro Lorefice

Reputation: 889

Getting img src with Scrapy gets weird results, why?

I'm trying to webscrape https://celulares.mercadolibre.com.ar/ with Scrapy 1.4.0. What I want to obtain is a list with description of the product along with the img src of that product. The problem is that when I run my spider it returns appropiately just the first 4 items (description + corresponding img src) and the rest of the item list is just the description with "none" img src. By analyzing the webpage source code I can tell that the only difference between the first 5 items and the rest is that the class attribute of the first ones it's called "lazy-load" while the other ones have an special id like "ML2178321". But considering that I don't refer the class name in the spider code I don't understand why the behaviour changes in this last items. I suspect about some JQuery/JS thing that I'm not aware. Here's the code of one of the first item containers:

<div class="image-content">

 <a href="https://articulo.mercadolibre.com.ar/MLA-644049024-samsung-galaxy-j7-prime-lector-de-huella16gb3gb-ram-_JM" class="figure item-image item__js-link"> 
 
 <img alt="Samsung Galaxy J7 Prime Lector De Huella+16gb+3gb Ram" src="https://http2.mlstatic.com/samsung-celulares-smartphones-D_Q_NP_771296-MLA25977210113_092017-X.jpg" class="lazy-load" srcset="https://http2.mlstatic.com/samsung-celulares-smartphones-D_Q_NP_771296-MLA25977210113_092017-X.jpg 1x, https://http2.mlstatic.com/samsung-celulares-smartphones-D_NQ_NP_771296-MLA25977210113_092017-V.jpg 2x" width="160" height="160"> 
 
 </a> 

</div>

And here the code of the container from one of the later images (the ones that return "None" img src):

 <div class="image-content">
 
 <a href="https://articulo.mercadolibre.com.ar/MLA-643729195-motorola-moto-g4-4ta-gen-4g-lte-16gb-ram-2gb-libre-gtia-_JM" class="figure item-image item__js-link"> 
 
 <img alt="Motorola Moto G4 4ta Gen 4g Lte 16gb Ram 2gb Libre Gtia" id="MLA643729195-I" srcset="https://http2.mlstatic.com/motorola-celulares-smartphones-D_Q_NP_765168-MLA26028117832_092017-X.jpg 1x, https://http2.mlstatic.com/motorola-celulares-smartphones-D_NQ_NP_765168-MLA26028117832_092017-V.jpg 2x" src="https://http2.mlstatic.com/motorola-celulares-smartphones-D_Q_NP_765168-MLA26028117832_092017-X.jpg" width="160" height="160"> 
 
 </a> 
 
 </div>
Lastly, here is the code that I'm running

import scrapy
import time

class MlarSpider(scrapy.Spider):
name = "mlar"
allowed_domains = ["mercadolibre.com.ar"]
start_urls = ['https://celulares.mercadolibre.com.ar/']

def parse(self, response):
    SET_SELECTOR = '.results-item'
    for item in response.css(SET_SELECTOR):

        PRODUCTO_SELECTOR = '.item__info-title span ::text'
        IMAGEN_SELECTOR = '.image-content a img'

        yield {
            'producto': item.css(PRODUCTO_SELECTOR).extract_first(),
            'imagen': item.css(IMAGEN_SELECTOR).xpath("@src").extract_first(),
        }

    NEXT_PAGE_SELECTOR = '.pagination__next a::attr(href)'
    next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
    if next_page:
        yield scrapy.Request(
            response.urljoin(next_page),
            callback=self.parse
        )

I've implemented the Barmar comment and got it working like a charm. Just added this lines to my spider:

        IXPATH= '@src'
        if item.css(IMAGEN_SELECTOR).xpath(IXPATH).extract_first() is None:
            IXPATH = '@data-src'
        yield {
            'producto': item.css(PRODUCTO_SELECTOR).extract_first(),
            'imagen': item.css(IMAGEN_SELECTOR).xpath(IXPATH).extract_first(),
        }

Upvotes: 2

Views: 2148

Answers (1)

Barmar
Barmar

Reputation: 781096

There's no src attribute in the later images. Here's the code of that image:

<img width='160' height='160' alt='Motorola Moto G4 4ta Gen 4g Lte 16gb Ram 2gb Libre Gtia' id='MLA643729195-I' class='loading' title='https://http2.mlstatic.com/motorola-celulares-smartphones-D_Q_NP_765168-MLA26028117832_092017-X.webp' data-src='https://http2.mlstatic.com/motorola-celulares-smartphones-D_Q_NP_765168-MLA26028117832_092017-X.webp' data-srcset='https://http2.mlstatic.com/motorola-celulares-smartphones-D_Q_NP_765168-MLA26028117832_092017-X.webp 1x, https://http2.mlstatic.com/motorola-celulares-smartphones-D_NQ_NP_765168-MLA26028117832_092017-V.webp 2x' />

The image URL is in the data-src attribute, not src.

The site is using a lazy loading plugin that waits for the user to scroll an image into the viewport before setting the src. At that time it copies the data-src attribute to src. What you posted is apparently the DOM element after this has happened, not the original HTML source, which is what scrapy sees.

You could simply change your script to look for data-src attributes if it can't find a src attribute.

Upvotes: 1

Related Questions