Naiky
Naiky

Reputation: 23

I need help scraping an aspx site

i am currently trying to scrape the main information of the products (name, price, and image url) from the diferent categories of a supermarket, but I am strugling with the page as it seems that I can't access the categorie url directly, it always redirects me to the home page.

The page I'm trying to scrape is: https://www.veadigital.com.ar/ (this is the main page) But I would like to access the diferent subcategorie's pages of the 'Bebidas' categories. The url of a subcategories is like this: https://www.veadigital.com.ar/Comprar/Home.aspx#_atCategory=false&_atGrilla=true&_id=141446

Only the id changes, but when I run my spider in the subcategorie url, I get the main page as a response. Sorry if I am not being clear enough, any help would be much apreciated

Here is my spider:

from scrapy.spiders import CrawlSpider
from scrapy.http import Request
from scrapy.selector import Selector
from ..items import ProductoGenericoItem


class VeaSpider(CrawlSpider):
    name = "vea"

    pos = 1
    base_url = "https://www.veadigital.com.ar/Comprar/Home.aspx#_atCategory=false&_atGrilla=true&_id={0}"
    c = 0

    cat = [
        141446, # a base de hierbas
        446126, # aguas sin gas
        446127, # aguas con gas
        446128, # aguas saborizadas
        141231, # aperitivos
        141236, # gaseosas cola
    ]

    start_urls = [
        base_url.format(cat[c])
    ]

    def parse(self, response):
        item = ProductoGenericoItem()

        product_info = response.xpath("//li[@class='grilla-producto-container full-layout']").getall()
        for p in product_info:
            sel = Selector(text=p)

            item['repetido'] = False
            item['superMercado'] = 'Vea Argentina'
            item['sucursal'] = 'NO'
            item['marca'] = ''
            item['empresa'] = ''
            item['ean'] = ''
            item['sku'] = ''
            item['idArticulo'] = ''
            item['nombre'] = sel.xpath(
                "normalize-space(/html/body/li/div[2]/div/div[2]/div/div//text())"
            ).get()
            item['descripcion'] = ''
            precio = sel.xpath(
                "normalize-space(/html/body/li/div[2]/div/div[2]/div/div[2]/text())"
            ).get()
            centavos = sel.xpath(
                "normalize-space(/html/body/li/div[2]/div/div[2]/div/div[2]/span/text())"
            ).get()
            item['precio'] = precio + ',' + centavos
            item['precioPromocional'] = ''
            item['condicion'] = ''
            item['precioPorMedida'] = sel.xpath(
                "normalize-space(/html/body/li/div[2]/div/div[2]/div/div[3]/text())"
            ).get()
            item['stock'] = ''
            item['categoria'] = 'Bebidas'
            item['subcategoria'] = response.xpath(
                "normalize-space(//div[@class='category-breadcrumbs']/a//text())"
            )
            item['segmento'] = response.xpath(
                "normalize-space(//span[@class='selected']//text())"
            )
            item['imagen'] = sel.xpath(
                "/html/body/li/div[2]/div/div/img[1]/@src"
            ).get()
            item['promocion'] = sel.xpath(
                "normalize-space(/html/body/li/div/div/p)"
            ).get()
            # if 'Oferta' in item['promocion']:
            #     item['precioPromocional'] = item['promocion'].replace('Oferta', '')
            if item['segmento'] != '':
                    item['posicionSegmento'] = self.pos
            else:
                item['posicionSubcategoria'] = self.pos

            self.pos += 1

            yield item


        if self.c < len(self.cat) - 1:
            self.c += 1
            self.pos = 1
            yield Request(
                self.base_url.format(self.cat[self.c]),
                callback=self.parse,
            )
        else:
            print('finished')

Upvotes: 0

Views: 76

Answers (1)

Albert D. Kallal
Albert D. Kallal

Reputation: 49089

Well, you assume that only the parameters are used. Session and internal code could still exist in code behind. And session could contain the referring URL or page that its called from.

I often have to re-direct a page since while I might have some parameters, I still had some session vars setup, and previous code setup to run. So if a incoming page is missing those internal session values? Then I re-direct, since I need the previous pages code setup to run to load up information and required values.

In a way, this is not a lot different then desktop code. You might be on a customer page, and then hit add invoices. So, code will run to get and setup up things like invoice payment terms, and a gazillion other things, and THEN launch the actual form to enter the invoice. And such kinds of code carries over to asp.net.

And then there is the simple issue of referring URL. I have a user comments feedback page. It is one of the few places in the web site allows non logged in users to enter stuff. But some spam bots were abusing this. (and they did not have to logged in to use the feedback page).

So, now the Feedback page (code behind) that code checks the referring URL (the URL that launched the page). If the referring URL is not from my web site, then I re-direct back to the main page. From the users point of view, then it just seems like the URL you entered did not work. So, often for reasons of security, one will check the referring URL, and if the page was not launched by the web site, then we know and reject the request.

This means that many of my URL's ONLY work if they were launched from by web site. If you try and enter the URL direct, or from a web scraper? Then the referring URL is not from my site anymore.

I thus I re-direct to previous pages to ensure that all kinds of setup code and things are correct before you actually get to the page in question.

I mean, for display of a project page? Well the user has to search, and then find the project. then clicking on that project row will setup QUITE A LOT of stuff before we jump to the project viewing page and THEN display that one project page.

In this case I use quite a few session() variables as opposed to parameters in the URL. But it don't matter - the simple matter is I need a whole lot of things setup JUST right before you jump to that project URL. If you type in the project url direct, then I jump you back to the project selection page, since I need all that info setup before the page loads.

And often a mix of parameters and session() are used. So JUST parameters in the URL will not work in a lot of cases. For really huge large web scalable web sites (amazon, Facebook etc.), then they can't afforded to use session(0) since that does not scale well when using "server farms". (each web server can't use say in memory sessions).

However, for a smaller web site? Then developers are far more free to use session() things to setup a page (internal values in code), and thus more often freely do so. So the extra "load" and server requirement to have useable variables outside of parameters in the URL can be used (and thus often are).

So, tons and tons of asp.net applications don't JUST USE parameters in URL's. This is especially the case if they have/allow logged on users. So the code behind will have values and information that is restricted to the one given user. So both URL parameters and internal session() variables are required for correct working of the web site.

So the smaller the site, and less huge scalable farm type of web site? Then the more freely developers used internal session() values. This allows the developers to write more business complex code, and do so with less efforts (and not clutter up the URL with all kinds of ugly junk I might add).

The other issue? In many cases while parameters are use in the URL? Well, I load up data before hand, and thus ONLY parameters in the range for the one logged on user will work. if I did not do this, then you could enter a ID or some parameter value that belongs to other users - big security hole. In the early days, I recall one credit card company used your "ID" in the URL. If you typed in another ID you could look at other's people credit card information! So this approach is less used, but MORE important this means that often just parameters in the URL are not sufficient anymore.

So often code will need or simply check the referring URL - and this adds additional security to the web site. So, your scrape code will have to launch the main page, and THEN jump to the page with the URL parameters. And it will have to click a button from that main page or page before, since the URL code checks the referring URL - and it has to be from THEIR web site - not a URL typed in by you (or your scraper). So you can't start the page from scratch and not having started from one of their web pages. The referring URL is checked.

In your example, the 2nd page with parameters works, and works without having to hit the main page. but then again, you need a means to get the correct parameters, and I don't see how that is practical to guess or make up the parameters that you have to guess in the first place.

Upvotes: 1

Related Questions