Getting img src with Scrapy gets weird results, why?

Question

I'm trying to webscrape https://celulares.mercadolibre.com.ar/ with Scrapy 1.4.0. What I want to obtain is a list with description of the product along with the img src of that product. The problem is that when I run my spider it returns appropiately just the first 4 items (description + corresponding img src) and the rest of the item list is just the description with "none" img src. By analyzing the webpage source code I can tell that the only difference between the first 5 items and the rest is that the class attribute of the first ones it's called "lazy-load" while the other ones have an special id like "ML2178321". But considering that I don't refer the class name in the spider code I don't understand why the behaviour changes in this last items. I suspect about some JQuery/JS thing that I'm not aware. Here's the code of one of the first item containers:

And here the code of the container from one of the later images (the ones that return "None" img src):

Lastly, here is the code that I'm running

import scrapy
import time

class MlarSpider(scrapy.Spider):
name = "mlar"
allowed_domains = ["mercadolibre.com.ar"]
start_urls = ['https://celulares.mercadolibre.com.ar/']

def parse(self, response):
    SET_SELECTOR = '.results-item'
    for item in response.css(SET_SELECTOR):

        PRODUCTO_SELECTOR = '.item__info-title span ::text'
        IMAGEN_SELECTOR = '.image-content a img'

        yield {
            'producto': item.css(PRODUCTO_SELECTOR).extract_first(),
            'imagen': item.css(IMAGEN_SELECTOR).xpath("@src").extract_first(),
        }

    NEXT_PAGE_SELECTOR = '.pagination__next a::attr(href)'
    next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
    if next_page:
        yield scrapy.Request(
            response.urljoin(next_page),
            callback=self.parse
        )

I've implemented the Barmar comment and got it working like a charm. Just added this lines to my spider:

        IXPATH= '@src'
        if item.css(IMAGEN_SELECTOR).xpath(IXPATH).extract_first() is None:
            IXPATH = '@data-src'
        yield {
            'producto': item.css(PRODUCTO_SELECTOR).extract_first(),
            'imagen': item.css(IMAGEN_SELECTOR).xpath(IXPATH).extract_first(),
        }

Barmar · Accepted Answer

There's no src attribute in the later images. Here's the code of that image:

The image URL is in the data-src attribute, not src.

The site is using a lazy loading plugin that waits for the user to scroll an image into the viewport before setting the src. At that time it copies the data-src attribute to src. What you posted is apparently the DOM element after this has happened, not the original HTML source, which is what scrapy sees.

You could simply change your script to look for data-src attributes if it can't find a src attribute.

Getting img src with Scrapy gets weird results, why?

Answers (1)

Related Questions